Skip to content

PR-1, Scrapy Skeleton with default LinkSpider, Anonymizer and Markitdown utils and pipelines#545

Open
nausikt wants to merge 4 commits intoarchi-physics:devfrom
nausikt:refactor/scrapers-to-scrapy/skeleton-and-link-spider
Open

PR-1, Scrapy Skeleton with default LinkSpider, Anonymizer and Markitdown utils and pipelines#545
nausikt wants to merge 4 commits intoarchi-physics:devfrom
nausikt:refactor/scrapers-to-scrapy/skeleton-and-link-spider

Conversation

@nausikt
Copy link
Copy Markdown

@nausikt nausikt commented Apr 7, 2026

This PR includes:

  • workable, standalone Scrapy structure.
  • default base LinkScraper followed Scrapy best practices/conventions.
  • Scrapy-Archi boundary via Adapters.
  • default safeguard/very conservative configurations.
  • built-in Anonymization Pipeline.
  • built-in Markitdown Pipeline.

Prerequisite

for picking up scrapy libs, bin

pip install -e .

How to independently test Scraping layer w/o Archi's Persistence layer

1) scrapy check <spider>

We should be able to tap into scraping land w/o having to touch data_manager, Archi-deployment stuffs and be able to check our scraper contract that should give us some quick grimsp of our parsing logic.

P.S. scrapy native way to do quick unit testing for parsing logic
P.S. link scraper should cover 90% use-cases, if there's nothing special in trivial sites/use-cases, we should be fine derived from LinkScraper mechanic, otherwise feels free to introduce new Spider logic.

> scrapy check link
...
----------------------------------------------------------------------
Ran 3 contracts in 4.691s

OK

For the given default link contracts, (every annotation start with @url underneath parse_* fn definitions.)

        def parse(self, response: Response) -> Iterator[WebPageItem | Request]:
        """
        Extract one item per response, then yield follow Requests up to max_depth.
        @url https://quotes.toscrape.com/
        @returns items 1
        @returns requests 1
        @scrapes url title
        """
        ...

The command, will check all our contracts underneath parse* fns.
In this case, its make sure that our subsequence parsing logic

  • give at least 1 Item, 1 request [1] and the scraped item should contain url and title well.

P.S. of course, in non-trivial cases like

  • Twiki we can put strict contract like @return requests 1 105
    (give at least 1 up-to 105 subsequence links to follows)

as a strict contract for making sure that our link extractions logic does narrow down tremendous list of (Requests) links to below 105 in the PR#2, as an example.*

2) scrapy crawl <spider>

This is where, scraper developer mostly spend time with, it's an e2e (pass through whole e2e scrapy life-cycle, middlewares, pipelines, default safeguard configurations, and generic default LinkScrapers.
P.S. Basically, we can check/test majority of scraper-specific things with logger.debug(...), logger.info(...) here.

> scrapy crawl link
...
2026-04-08 15:11:00 [scrapy.core.engine] INFO: Spider opened
2026-04-08 15:11:00 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2026-04-08 15:11:00 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2026-04-08 15:11:03 [link] INFO: Extracted 47 links from https://quotes.toscrape.com/    #!!! 47 links extracted given all our logics
...
2026-04-08 15:26:03 [scrapy.extensions.logstats] INFO: Crawled 41 pages (at 20 pages/min), scraped 40 items (at 20 items/min)
2026-04-08 15:26:21 [link] INFO: Reached max depth 1.          #!!! ^------ it give us an ideas of current scraped (download) rate!
2026-04-08 15:26:21 [scrapy.core.engine] INFO: Closing spider (finished)
2026-04-08 15:26:21 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 23396,
 'downloader/request_count': 56,
 'downloader/request_method_count/GET': 56,
 'downloader/response_bytes': 308413,
 'downloader/response_count': 56,
 'downloader/response_status_count/200': 47,
 'downloader/response_status_count/308': 8,
 'downloader/response_status_count/404': 1,
 'dupefilter/filtered': 1,
 'elapsed_time_seconds': 138.118437,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2026, 4, 8, 13, 26, 21, 231722, tzinfo=datetime.timezone.utc),
 'item_scraped_count': 47,
 'items_per_minute': 20.434782608695652,
 'log_count/INFO': 99,
 'memusage/max': 525565952,
 'memusage/startup': 460193792,
 'request_depth_max': 1,
 'response_received_count': 48,
 'responses_per_minute': 20.869565217391305,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 55,
 'scheduler/dequeued/memory': 55,
 'scheduler/enqueued': 55,                # !!! <------- its mean 56 links to follows for the next level of depth (requests enqueued given all parsing, links extractions, default configs.)
 'scheduler/enqueued/memory': 55,
 'start_time': datetime.datetime(2026, 4, 8, 13, 24, 3, 113285, tzinfo=datetime.timezone.utc)}
2026-04-08 15:26:21 [scrapy.core.engine] INFO: Spider closed (finished)

So, in this cases, it can scraped https://quotes.toscrape.com/ (typcially, been toy website for scraping in general) well as we expected.
Ofcourse, we can put mit sources as defaults test/contract as well.


Other than that, the Markdown/Anonymization Pipelines could be tested in logger.debug(...) mode, let me know whether you can see your mit sources converted to markdown!
**P.S. both anonymize and markitdown was ON by default (only in scrapy-standalone mode).


[1] Please, also consult with intuitive diagram below from time to time, in this PR we mostly focus on 1, 6, 7 and 8.

image

@nausikt nausikt mentioned this pull request Apr 7, 2026
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant