PR-1, Scrapy Skeleton with default LinkSpider, Anonymizer and Markitdown utils and pipelines by nausikt · Pull Request #545 · archi-physics/archi

nausikt · 2026-04-07T12:27:48Z

This PR includes:

workable, standalone Scrapy structure.
default base LinkScraper followed Scrapy best practices/conventions.
Scrapy-Archi boundary via Adapters.
default safeguard/very conservative configurations.
built-in Anonymization Pipeline.
built-in Markitdown Pipeline.

Prerequisite

for picking up scrapy libs, bin

pip install -e .

How to independently test Scraping layer w/o Archi's Persistence layer

1) scrapy check <spider>

We should be able to tap into scraping land w/o having to touch data_manager, Archi-deployment stuffs and be able to check our scraper contract that should give us some quick grimsp of our parsing logic.

P.S. scrapy native way to do quick unit testing for parsing logic
P.S. link scraper should cover 90% use-cases, if there's nothing special in trivial sites/use-cases, we should be fine derived from LinkScraper mechanic, otherwise feels free to introduce new Spider logic.

> scrapy check link
...
----------------------------------------------------------------------
Ran 3 contracts in 4.691s

OK

For the given default link contracts, (every annotation start with @url underneath parse_* fn definitions.)

        def parse(self, response: Response) -> Iterator[WebPageItem | Request]:
        """
        Extract one item per response, then yield follow Requests up to max_depth.
        @url https://quotes.toscrape.com/
        @returns items 1
        @returns requests 1
        @scrapes url title
        """
        ...

The command, will check all our contracts underneath parse* fns.
In this case, its make sure that our subsequence parsing logic

give at least 1 Item, 1 request [1] and the scraped item should contain url and title well.

P.S. of course, in non-trivial cases like

Twiki we can put strict contract like @return requests 1 105
(give at least 1 up-to 105 subsequence links to follows)

as a strict contract for making sure that our link extractions logic does narrow down tremendous list of (Requests) links to below 105 in the PR#2, as an example.*

2) scrapy crawl <spider>

This is where, scraper developer mostly spend time with, it's an e2e (pass through whole e2e scrapy life-cycle, middlewares, pipelines, default safeguard configurations, and generic default LinkScrapers.
P.S. Basically, we can check/test majority of scraper-specific things with logger.debug(...), logger.info(...) here.

> scrapy crawl link
...
2026-04-08 15:11:00 [scrapy.core.engine] INFO: Spider opened
2026-04-08 15:11:00 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2026-04-08 15:11:00 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2026-04-08 15:11:03 [link] INFO: Extracted 47 links from https://quotes.toscrape.com/    #!!! 47 links extracted given all our logics
...
2026-04-08 15:26:03 [scrapy.extensions.logstats] INFO: Crawled 41 pages (at 20 pages/min), scraped 40 items (at 20 items/min)
2026-04-08 15:26:21 [link] INFO: Reached max depth 1.          #!!! ^------ it give us an ideas of current scraped (download) rate!
2026-04-08 15:26:21 [scrapy.core.engine] INFO: Closing spider (finished)
2026-04-08 15:26:21 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 23396,
 'downloader/request_count': 56,
 'downloader/request_method_count/GET': 56,
 'downloader/response_bytes': 308413,
 'downloader/response_count': 56,
 'downloader/response_status_count/200': 47,
 'downloader/response_status_count/308': 8,
 'downloader/response_status_count/404': 1,
 'dupefilter/filtered': 1,
 'elapsed_time_seconds': 138.118437,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2026, 4, 8, 13, 26, 21, 231722, tzinfo=datetime.timezone.utc),
 'item_scraped_count': 47,
 'items_per_minute': 20.434782608695652,
 'log_count/INFO': 99,
 'memusage/max': 525565952,
 'memusage/startup': 460193792,
 'request_depth_max': 1,
 'response_received_count': 48,
 'responses_per_minute': 20.869565217391305,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 55,
 'scheduler/dequeued/memory': 55,
 'scheduler/enqueued': 55,                # !!! <------- its mean 56 links to follows for the next level of depth (requests enqueued given all parsing, links extractions, default configs.)
 'scheduler/enqueued/memory': 55,
 'start_time': datetime.datetime(2026, 4, 8, 13, 24, 3, 113285, tzinfo=datetime.timezone.utc)}
2026-04-08 15:26:21 [scrapy.core.engine] INFO: Spider closed (finished)

So, in this cases, it can scraped https://quotes.toscrape.com/ (typcially, been toy website for scraping in general) well as we expected.
Ofcourse, we can put mit sources as defaults test/contract as well.

Other than that, the Markdown/Anonymization Pipelines could be tested in logger.debug(...) mode, let me know whether you can see your mit sources converted to markdown!
**P.S. both anonymize and markitdown was ON by default (only in scrapy-standalone mode).

[1] Please, also consult with intuitive diagram below from time to time, in this PR we mostly focus on 1, 6, 7 and 8.

…gs, can scrapy check/crawl link.

…er to support generic markdow, twiki, discourse patterns.

…ownConvertor.

nausikt added 4 commits April 7, 2026 02:01

deprecate legacy scrapers and their old structure.

fd7f885

standalone workable scrapy project structure with safe default settin…

1778dd4

…gs, can scrapy check/crawl link.

[anonymizer] Scrapy AnnymizationPipeline and patching global anonymiz…

1130333

…er to support generic markdow, twiki, discourse patterns.

[markitdown] scrapy MarkitDownPipeline and standalone generic Markitd…

5dede51

…ownConvertor.

nausikt mentioned this pull request Apr 7, 2026

Migrate Scrapers to Scrapy #546

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR-1, Scrapy Skeleton with default LinkSpider, Anonymizer and Markitdown utils and pipelines#545

PR-1, Scrapy Skeleton with default LinkSpider, Anonymizer and Markitdown utils and pipelines#545
nausikt wants to merge 4 commits intoarchi-physics:devfrom
nausikt:refactor/scrapers-to-scrapy/skeleton-and-link-spider

nausikt commented Apr 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nausikt commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

This PR includes:

Prerequisite

How to independently test Scraping layer w/o Archi's Persistence layer

1) scrapy check <spider>

2) scrapy crawl <spider>

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nausikt commented Apr 7, 2026 •

edited

Loading