PR-1, Scrapy Skeleton with default LinkSpider, Anonymizer and Markitdown utils and pipelines#545
Open
nausikt wants to merge 4 commits intoarchi-physics:devfrom
Conversation
…gs, can scrapy check/crawl link.
…er to support generic markdow, twiki, discourse patterns.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR includes:
Prerequisite
for picking up scrapy libs, bin
How to independently test Scraping layer w/o Archi's Persistence layer
1) scrapy check <spider>
We should be able to tap into scraping land w/o having to touch data_manager, Archi-deployment stuffs and be able to check our scraper contract that should give us some quick grimsp of our parsing logic.
P.S. scrapy native way to do quick unit testing for parsing logic
P.S. link scraper should cover 90% use-cases, if there's nothing special in trivial sites/use-cases, we should be fine derived from LinkScraper mechanic, otherwise feels free to introduce new Spider logic.
For the given default link contracts, (every annotation start with @url underneath parse_* fn definitions.)
The command, will check all our contracts underneath
parse*fns.In this case, its make sure that our subsequence parsing logic
urlandtitlewell.P.S. of course, in non-trivial cases like
@return requests 1 105(give at least 1 up-to 105 subsequence links to follows)
as a strict contract for making sure that our link extractions logic does narrow down tremendous list of (Requests) links to below 105 in the PR#2, as an example.*
2) scrapy crawl <spider>
This is where, scraper developer mostly spend time with, it's an e2e (pass through whole e2e scrapy life-cycle, middlewares, pipelines, default safeguard configurations, and generic default LinkScrapers.
P.S. Basically, we can check/test majority of scraper-specific things with
logger.debug(...), logger.info(...)here.So, in this cases, it can scraped
https://quotes.toscrape.com/(typcially, been toy website for scraping in general) well as we expected.Ofcourse, we can put mit sources as defaults test/contract as well.
Other than that, the Markdown/Anonymization Pipelines could be tested in
logger.debug(...)mode, let me know whether you can see your mit sources converted to markdown!**P.S. both anonymize and markitdown was ON by default (only in scrapy-standalone mode).
[1] Please, also consult with intuitive diagram below from time to time, in this PR we mostly focus on 1, 6, 7 and 8.