-
-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crawl JS and CSS #454
Comments
Hi @Dooriin, the pages.jsonl file is meant to be an index of the HTML pages only, but you should be able to find everything that was crawled in the CDXJ indices. These will be within the WACZ file if you're using the Let us know if that helps! |
@tw4l thank you for your reply! Is there a possibility to request this feature, like we have pages.jsonl fille, also to get resources.jsonl fille for assets such as JS, CSS, PDF or any other and so we can filter it with flags perhaps ? I am using that for testing purposes and so if one asset if 404 then it has to be addressed immediately I will be happy to support and contribute towards the project by Sponsoring :) if that is something that can be achieved Looking forward to hearing back! |
Hi @Dooriin, sorry for the delayed response! This is something we're actually looking into now as we develop features around assisted crawl QA in Browsertrix Cloud. We have a PR merged in the I'm wondering if that would help with your use case. It's possible that we could add an argument to the crawler to add these urls to the |
@Dooriin Can you explain more what you're looking for? We will soon have the CDXJ index generated while the crawl is running, so you can also peek in the |
Hi @ikreymer What I was wondering if, there's a way to add few features such as. Getting all assets such as js and css for each page and make sure they are also 200. Was curious if there's a way to add like an array, for each page what assets it has or something like that I am mostly referring to crawls/collections/xxx/pages/pages.jsonl file. Would be great to have parent url next to the crawler url |
This tool is really designed for archiving, not testing, and we have special formats intended for storing and replaying archived data at a later time. If your goal is just testing and ensuring correct status code, I'd suggest using something like Playwright which is designed specifically for that use case, see: Playwright Response Interception You might find that to be easier to use for what you're trying to do.
We do generate a |
Closing as this has mostly been answered. |
@ikreymer Thanks for your reply! The main reason I use this tool is to crawl through the application, as we have over 1000+ pages. I will keep that in mind! |
Is there a way we can store JS, CSS and/or any other assets besides html with the pages.jsonl file ?
I am trying to crawl everything and have a list of all endpoints, so I can perform a health-check on every single endpoint located on the web app.
I was looking through the flags and wasn't able to find anything related to the assets type to be crawled
The text was updated successfully, but these errors were encountered: