Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawl JS and CSS #454

Closed
Dooriin opened this issue Jan 3, 2024 · 8 comments
Closed

Crawl JS and CSS #454

Dooriin opened this issue Jan 3, 2024 · 8 comments
Labels
question Further information is requested

Comments

@Dooriin
Copy link

Dooriin commented Jan 3, 2024

Is there a way we can store JS, CSS and/or any other assets besides html with the pages.jsonl file ?

I am trying to crawl everything and have a list of all endpoints, so I can perform a health-check on every single endpoint located on the web app.

I was looking through the flags and wasn't able to find anything related to the assets type to be crawled

@tw4l
Copy link
Member

tw4l commented Jan 3, 2024

Hi @Dooriin, the pages.jsonl file is meant to be an index of the HTML pages only, but you should be able to find everything that was crawled in the CDXJ indices. These will be within the WACZ file if you're using the --generateWACZ flag, or can be generated separately with --generateCDX.

Let us know if that helps!

@Dooriin
Copy link
Author

Dooriin commented Jan 3, 2024

@tw4l thank you for your reply!

Is there a possibility to request this feature, like we have pages.jsonl fille, also to get resources.jsonl fille for assets such as JS, CSS, PDF or any other and so we can filter it with flags perhaps ?

I am using that for testing purposes and so if one asset if 404 then it has to be addressed immediately

I will be happy to support and contribute towards the project by Sponsoring :) if that is something that can be achieved

Looking forward to hearing back!

@tw4l
Copy link
Member

tw4l commented Jan 18, 2024

Hi @Dooriin, sorry for the delayed response! This is something we're actually looking into now as we develop features around assisted crawl QA in Browsertrix Cloud.

We have a PR merged in the dev-1.0.0 branch that lists out the page resources and their status codes as records inside the WARC files. You can see an example here: #457.

I'm wondering if that would help with your use case. It's possible that we could add an argument to the crawler to add these urls to the pages.jsonl (or a similar resources.jsonl as you suggest) if it'd be helpful to have it exposed at a higher level in the WACZ than inside the WARC files, that's just convenient for us in how we plan to handle QA runs.

@ikreymer
Copy link
Member

@Dooriin Can you explain more what you're looking for? We will soon have the CDXJ index generated while the crawl is running, so you can also peek in the tmp-cdx directory to get a list of all the resources captured. We could also add more extended logging if you want to parse the container stdout, for each URL that is being retrieved, that is also doable.

@Dooriin
Copy link
Author

Dooriin commented Apr 2, 2024

Hi @ikreymer
I am using your product as a testing tool. Crawling through the application, getting all the URLs and then making sure they are all functioning and are returning 200

What I was wondering if, there's a way to add few features such as. Getting all assets such as js and css for each page and make sure they are also 200. Was curious if there's a way to add like an array, for each page what assets it has or something like that

I am mostly referring to crawls/collections/xxx/pages/pages.jsonl file. Would be great to have parent url next to the crawler url

@ikreymer
Copy link
Member

ikreymer commented Apr 2, 2024

Hi @ikreymer I am using your product as a testing tool. Crawling through the application, getting all the URLs and then making sure they are all functioning and are returning 200

This tool is really designed for archiving, not testing, and we have special formats intended for storing and replaying archived data at a later time. If your goal is just testing and ensuring correct status code, I'd suggest using something like Playwright which is designed specifically for that use case, see: Playwright Response Interception You might find that to be easier to use for what you're trying to do.

What I was wondering if, there's a way to add few features such as. Getting all assets such as js and css for each page and make sure they are also 200. Was curious if there's a way to add like an array, for each page what assets it has or something like that

We do generate a urn:pageinfo:<url> record in the WARC file that contains all the resources on the page, but again, this is designed to be used as an archival format, not for testing.

@ikreymer ikreymer added the question Further information is requested label Apr 3, 2024
@ikreymer
Copy link
Member

ikreymer commented Apr 3, 2024

Closing as this has mostly been answered.

@ikreymer ikreymer closed this as completed Apr 3, 2024
@github-project-automation github-project-automation bot moved this from Triage to Done! in Webrecorder Projects Apr 3, 2024
@Dooriin
Copy link
Author

Dooriin commented Apr 4, 2024

@ikreymer Thanks for your reply! The main reason I use this tool is to crawl through the application, as we have over 1000+ pages.

I will keep that in mind!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
Status: Done!
Development

No branches or pull requests

3 participants