You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We need a unified, robust crawler that can be configured in a site-specific way, guided by decades of battle-tested conventions from the Scrapy community that improve extensibility, reduce the risk of throttling, and support long-term maintainability.
with core reusable/shared Pipelines/Middlewares, set of Scrapers (Spiders) e.g Link, Twiki, Discourse, ...etc and ScrapyItem-to-ArchiScrapedResource adapter as a boundary. (**Please see also the test results and details in each PR.)
PR-2, TwikiSpider, Auth middlewares and CERN SSO This PR includes two commits: [1] demonstrating and extending the TWiki integration, and [2] showing how a specific AuthProvider, such as SSO_CERN, can be cleanly integrated into the Scrapy framework without Archi-specific context.
PR-3, SSO-based DiscourseSpider This PR shows how a CERN SSO–based Discourse scraper can enjoy built-in AuthMiddleware, Anonymizer, and MarkItDown pipelines following the same pattern as the Link & Twiki scrapers, keeping the integration low-footprint while allowing site-specific policy customization and improving robustness.
Phase 2, Wiring with Archi Interfaces, put everything all together
start by decoupling GitScraper from scraping layer since it does not naturally fit with the same crawling/scraping as other sources. (which mostly clone whole git repo as local files, so it is better treated like dedicated manager like file manager instead)
wire the Scrapy-to-Archi Persistence layer through the Persistence service/pipeline and adapters.
refactor source.links interfaces into sources.web, separating source kinds by spider; SSO become a configurable auth_provider instead.
pass Archi web sources params/args to spiders.
e2e tests as a whole data_manager service, with basic-scraping deployment example.
P.S. Since I can’t create branches on the upstream repo yet, the stacked PRs are currently opened against my fork first and will be re-target them as merge progress happens.
As we encountered #464
We need a unified, robust crawler that can be configured in a site-specific way, guided by decades of battle-tested conventions from the Scrapy community that improve extensibility, reduce the risk of throttling, and support long-term maintainability.
Feel free to review
OR the splits version into 2 phases of multi-PRs as follows.
which we may start with safely merge
Phase 1, Standalone Scrapy
with core reusable/shared Pipelines/Middlewares, set of Scrapers (Spiders) e.g Link, Twiki, Discourse, ...etc and ScrapyItem-to-ArchiScrapedResource adapter as a boundary. (**Please see also the test results and details in each PR.)
A skeleton that has clear conventions, extensible, easy to maintain run unit-test, e2e tests independently from Archi via native scrapy check/crawl
This PR includes two commits: [1] demonstrating and extending the TWiki integration, and [2] showing how a specific AuthProvider, such as SSO_CERN, can be cleanly integrated into the Scrapy framework without Archi-specific context.
This PR shows how a CERN SSO–based Discourse scraper can enjoy built-in AuthMiddleware, Anonymizer, and MarkItDown pipelines following the same pattern as the Link & Twiki scrapers, keeping the integration low-footprint while allowing site-specific policy customization and improving robustness.
Phase 2, Wiring with Archi Interfaces, put everything all together
source.linksinterfaces intosources.web, separating source kinds by spider; SSO become a configurable auth_provider instead.basic-scrapingdeployment example.P.S. Since I can’t create branches on the upstream repo yet, the stacked PRs are currently opened against my fork first and will be re-target them as merge progress happens.