Migrate Scrapers to Scrapy

As we encountered #464

We need a unified, robust crawler that can be configured in a site-specific way, guided by decades of battle-tested conventions from the Scrapy community that improve extensibility, reduce the risk of throttling, and support long-term maintainability.

Feel free to review 
- [ ] the whole https://github.com/archi-physics/archi/pull/547 PR 

OR the splits version into 2 phases of multi-PRs as follows.
which we may start with safely merge
- [x] PR-0s,  https://github.com/archi-physics/archi/pull/456 and first version of Liv's IndicoScraper, https://github.com/archi-physics/archi/pull/550.

## Phase 1, Standalone Scrapy
*with core reusable/shared Pipelines/Middlewares, set of Scrapers (Spiders) e.g Link, Twiki, Discourse, ...etc and ScrapyItem-to-ArchiScrapedResource adapter as a boundary.*  (***Please see also the test results and details in each PR.*)

- [ ] https://github.com/archi-physics/archi/pull/545 
*A skeleton that has clear conventions, extensible, easy to maintain run *unit-test, e2e tests* independently from Archi via native **scrapy check/crawl***
- [ ] [PR-2, TwikiSpider, Auth middlewares and CERN SSO](https://github.com/nausikt/archi/pull/7)
*This PR includes two commits: [1] demonstrating and extending the TWiki integration, and [2] showing how a specific AuthProvider, such as SSO_CERN, can be cleanly integrated into the Scrapy framework without Archi-specific context.*
- [ ] [PR-3, SSO-based DiscourseSpider](https://github.com/nausikt/archi/pull/8)
*This PR shows how a CERN SSO–based Discourse scraper can enjoy built-in AuthMiddleware, Anonymizer, and MarkItDown pipelines following the same pattern as the Link & Twiki scrapers, keeping the integration low-footprint while allowing site-specific policy customization and improving robustness.*

## Phase 2, Wiring with Archi Interfaces, put everything all together
- [ ] [PR-4, Wire standalone scrapy scrapers into refactored archi interfaces.](https://github.com/nausikt/archi/pull/9)
- start by decoupling GitScraper from scraping layer since it does not naturally fit with the same crawling/scraping as other sources. *(which mostly clone whole git repo as local files, so it is better treated like dedicated manager like file manager instead)*
- wire the Scrapy-to-Archi Persistence layer through the Persistence service/pipeline and adapters. 
- refactor `source.links` interfaces into `sources.web`, separating source kinds by spider; SSO become a configurable auth_provider instead.
- pass Archi web sources params/args to spiders.
- e2e tests as a whole data_manager service, with `basic-scraping` deployment example.
- [ ] PR-5, upcoming Second version from [Liv's IndicoScraper](https://github.com/livaage/archi/tree/feature/scrape-indico).

*P.S. Since I can’t create branches on the upstream repo yet, the stacked PRs are currently opened against my fork first and will be re-target them as merge progress happens.*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate Scrapers to Scrapy #546

Phase 1, Standalone Scrapy

Phase 2, Wiring with Archi Interfaces, put everything all together

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Migrate Scrapers to Scrapy #546

Description

Phase 1, Standalone Scrapy

Phase 2, Wiring with Archi Interfaces, put everything all together

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions