Skip to content

Migrate Scrapers to Scrapy #546

@nausikt

Description

@nausikt

As we encountered #464

We need a unified, robust crawler that can be configured in a site-specific way, guided by decades of battle-tested conventions from the Scrapy community that improve extensibility, reduce the risk of throttling, and support long-term maintainability.

Feel free to review

OR the splits version into 2 phases of multi-PRs as follows.
which we may start with safely merge

Phase 1, Standalone Scrapy

with core reusable/shared Pipelines/Middlewares, set of Scrapers (Spiders) e.g Link, Twiki, Discourse, ...etc and ScrapyItem-to-ArchiScrapedResource adapter as a boundary. (**Please see also the test results and details in each PR.)

Phase 2, Wiring with Archi Interfaces, put everything all together

  • PR-4, Wire standalone scrapy scrapers into refactored archi interfaces.
  • start by decoupling GitScraper from scraping layer since it does not naturally fit with the same crawling/scraping as other sources. (which mostly clone whole git repo as local files, so it is better treated like dedicated manager like file manager instead)
  • wire the Scrapy-to-Archi Persistence layer through the Persistence service/pipeline and adapters.
  • refactor source.links interfaces into sources.web, separating source kinds by spider; SSO become a configurable auth_provider instead.
  • pass Archi web sources params/args to spiders.
  • e2e tests as a whole data_manager service, with basic-scraping deployment example.
  • PR-5, upcoming Second version from Liv's IndicoScraper.

P.S. Since I can’t create branches on the upstream repo yet, the stacked PRs are currently opened against my fork first and will be re-target them as merge progress happens.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions