[Ref] Integrate Scrapy-based scrapers into Archi interfaces by nausikt · Pull Request #547 · archi-physics/archi

nausikt · 2026-04-08T15:36:51Z

There's only one culprit for SSO playwright + chromium to work as expected. please beware of ...

On macOS with an ARM-based CPU, please make sure Rosetta is enabled for your Podman machine, or available on your laptop and being used naturally by container.

Otherwise, it may fall back to a dedicated QEMU-emulated Chromium instance, which cause race conditions!

Other Linux or x86_64, production machine, VMs should works fine.

End-to-End Integrated Archi Infrastructure Test

The main caveat is when seed URLs mix SSO-protected and public pages, such as TWiki.
In those cases, please list the SSO-protected seed URLs First, followed by the remaining public/SSO-mixed URLs.

        twiki:
            auth_provider_name: cern_sso   # we can safely remove if we crawl only just public pages.
            anonymize_data: true
            markitdown: true
            urls: # If any TWiki URLs are SSO-protected, please list them first for efficiency and robustness
                - https://twiki.cern.ch/twiki/bin/view/CMS/HeavyIons # sso-protected twiki pages.
                - https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuide # public twiki seed urls.

archi create --name basic-scraping --config examples/deployments/basic-scraping/config.yaml -e .env --podman --services chatbot --verbosity=4 --force

Please, feels free to adjust the example/deployment/basic-scraping/config.yaml as you see fit, Currently, it might take very long time to pass through every web sources.

LinkSpider for MIT sources

may take less than 1 hr.
BUT... TWiki HeavyIon might take

at least (300 + 100 from CRAB)++ docs x 60 seconds (crawl delay)
~= 24,000 seconds (6-7hrs)
Discourse may take

10 seconds (crawl delay) each (47 pages + at least 500 docs)
~= 1 hr 40 mins
new GitManager ~= as i've tested on dmwm/CRABServer and DMWM/CRABClient

2 repos have ~400 files
its take less than an hr, work efficiently well like in the past.

Expected Result
[1] All comprehensive sources are ingesting which might take hrs to finished.

[2] Also, most of our docs should be in anonymized, markdown format right away.
P.S. been anonymized as best as NLP module and our heuristic-based (hard-code) name-replacement patterns per source can.

Standalone Scrapy Test

more details are comming soon...

…acts.

…afe DEPTH_LIMIT.

…e test-cases.

…ests.

…C via parsers, introduce extension points parse_item, parse_follow_links, default implementation, toscrape and twiki example.

…ble crawler args.

…both toscrape, twiki examples.

… for toscrape, twiki.

…-world twiki crawling contracts with saftest default values.

…orks in HeavyIon use-cases.

…ace & instantiations to GitManager and Scrapy's ScraperManager.

…ers, wired to legacy interface.

…s, refactor interface, GitManager works, link, twiki public + sso sources example.

… if not main thread, bugs scraper_manager not aware of enabled flag.

…as doms convert to markitdown later.

…th limit

… vectorsotre manager.

…sible, should refactoring later to have structure/unstructure redact more separately.

…nymized.

nausikt · 2026-04-08T16:56:56Z

Note that we still lack the mit_sso auth provider, which will be added ASAP.

For now, we can work around this by testing MIT sources and CERN sources in a mutually exclusive manner.

pmlugato · 2026-04-08T17:50:13Z

Hi @nausikt , thanks for this PR! Looking into it and testing now -- one thing I noticed right away, please add any new packages to the requirements files -- having them in the pyproject is fine for now for testing but this way they the images will be updated accordingly once merged to main

nausikt · 2026-04-08T17:51:51Z

Oh dear, let me fix it now!

pmlugato · 2026-04-08T18:04:01Z

examples/deployments/basic-scraping/config.yaml

+        urls:
+          - https://ppc.mit.edu/news/
+        max_depth: 2
+        max_pages: 100
+        delay: 10
+        markitdown: true
+        input_lists:
+          - examples/deployments/basic-scraping/miscellanea.list


so you can both pass urls directly in the config, or via a .list file with urls as before? For the latter case, the file can have both public and sso, without any need for a prefix, right?

Exactly Pietro! for all scrapers (spiders) by design.

pmlugato · 2026-04-08T18:10:39Z

examples/deployments/basic-scraping/config.yaml

+    git:
+      urls:
+        - https://github.com/dmwm/CRABServer
+        - https://github.com/dmwm/CRABClient


unlike web.links, web.git and web.twiki only take the urls directly in the config? if you can also support an input_list here, would be nice for people who want to keep it separate.

Also, is there a reason twiki takes a separate list? I understand the specific configuration options it needs, but why can't the twiki urls live in same list as other links and twiki options would be under web.links.twiki? to support different delay times more easily maybe?

Comment: I think it's better to have, e.g., links.list and git.list, as it would be in this case, than how we had it before with the prefixes, so separate input lists is good.

unlike web.links, web.git and web.twiki only take the urls directly in the config? if you can also support an input_list here, would be nice for people who want to keep it separate.

For web.git now, Yes only take urls. Roger that, will support input_list there!

web.link, web.twiki and web.discourse support both urls and input_list.

Also, is there a reason twiki takes a separate list? I understand the specific configuration options it needs, but why can't the twiki urls live in same list as other links and twiki options would be under web.links.twiki? to support different delay times more easily maybe?

@pmlugato Got your point! agree on UX can be better by not taking separate list, will converging!🫡

For clarity, only 1web.links.input_lists as a main portal, site/scraper specific asides under web.links.<site|spider> pattern.

web: links: ### Global Default Link configs go here, implicitly set here max_depth: 2 max_pages: 100 delay: 10 markitdown: true input_lists: # Only 1 portal, we pour every links here... as long as it fits scraper nature. - examples/deployments/basic-scraping/miscellanea.list ### Site/Spider non-list-related/specific configuration goes belows. # <spider/site>: twiki: delay: 60 # <---- override global deny: .... allow: .... # .... w/o list discourse: category_paths: - .... delay: 10 keywords: ... indico: # <--- upcoming IncdioSpider will stay in this level as well. keywords:

Then behind the scene, Archi scraper_manager automatically aware of Spider and there config via newly introduced registry instead.

DOMAIN_SPIDER_REGISTRY = { "twiki.cern.ch": TWikiSpider, "cms-talk.web.cern.ch": DiscourseSpider, "indico.cern.ch": IndicoSpider, }

Is this what you had in mind? but this registry has to be configured by users somewhere anyhow🤔

Otherwise, how can scraper_manager resolved link and know source_kind from flat list of urls? but i don't like prefix approach! IMO it's deteriorate UX.

How about just added domain for scraper_manager to pick-up as registry!

links: input_lists: - .... twiki: domain: "twiki.cern.ch" delay: 60 # <---- override global deny: .... allow: .... # .... w/o list discourse: domain: "cms-talk.web.cern.ch" - .... delay: 10 keywords: ... indico: # <--- upcoming IncdioSpider will stay in this level as well. domain: "indico.cern.ch" keywords:

This way we give users best UX, resolving spider are transparent and have least footprint to the users's config.

with probably only caveats[?] for now, like we can't have 2 twiki/discourse/indico source instances at the same time.

indico: domain: "indico.cern.ch", "indico.mit.edu" keywords: ...

Comment: I think it's better to have, e.g., links.list and git.list, as it would be in this case, than how we had it before with the prefixes, so separate input lists is good.

Ack for the comment,

BTW, about naming,

would you prefer me to rename git.input_lists -> git.list as well? or avoid list words and remain the same?

would you like me to completely drop git.urls, web.links.urls and just change to input_lists style to avoid confusing users? preserve only 1 standard way?.

Although, personally, web.urls are fairly convenience for me to peek/edit everything with just a glance Archi in the same config.yaml but it may be just better UX for debugging things*

nausikt · 2026-04-08T18:15:19Z

src/cli/managers/config_manager.py

+            if not isinstance(sources_section, dict):
+                continue
+            web = sources_section.get("web", {}) or {}
+            if not isinstance(web, dict):
+                continue
+            for spider_key, sub in web.items():
+                if spider_key in _WEB_TOP_LEVEL_STATIC_KEYS:
+                    continue
+                if not isinstance(sub, dict):
+                    continue
+                wlists = sub.get("input_lists") or []
+                if isinstance(wlists, list):
+                    collected.extend(wlists)


@pmlugato input_lists are understood by all scrapers (Spider) here and no prefix like sso-, eos-, or git-* is required in any urls.

For SSO-protected urls, though, we must explicitly provide the right auth_provider_name: cern_sso|.... ourself. and the SSO-protected url has to come first, but by design this should not be a caveat. URLs should works in any order!

pmlugato · 2026-04-08T19:35:43Z

@nausikt thanks a lot for this PR, it looks really nice! I have tested it with a few different source types and configurations and everything seems to be working smoothly which is great! Thanks a lot also for the active iteration offline :)

I would be happy to merge this soon into dev, maybe we can have one more person look at it but can also be done after the fact...

One request before doing so is the following: if you could write some nice documentation about all of this in the docs/ it would be great, including deprecating the old version of things there. Once that's done, I think we should be almost good to go into dev.

Thanks a lot again for all the hard work!

nausikt · 2026-04-08T19:48:14Z

@pmlugato Thanks to you guys for having me on board!

One last thing! i've summarized all my micro-repsonses for you to finalize here.
Before, I dive into converging to your review & and docs.

    web:
        links:
          ### Global Default Link configs go here, implicitly set here
          max_depth: 2
          max_pages: 100
          delay: 10
          markitdown: true
          # Only 1 input_lists portal, we pour every links here..
          input_lists:
          - examples/deployments/basic-scraping/miscellanea.list
          ### Site/Spider non-list-related/specific configuration goes belows.
          # <spider/site>:
          twiki:
              domain: "twiki.cern.ch"
              delay: 60  # <---- override global
              deny: ....
              allow: ....
              # .... w/o list
          discourse:
              domain: "cms-talk.web.cern.ch"
                - ....
              delay: 10
              keywords: ...
          indico:   # <--- upcoming IncdioSpider will stay in this level as well.
              domain: "indico.cern.ch"
              keywords:

Is this ideal you had in mind?
I will deprecated <manager>.urls and stick with web.links.input_lists, git.input_lists pattern/style
shall we? reduced to web.input_lists no need for redundant links in web.links (I highly encourage this, nested/indentation have a significant chances of confuse users, this should be better UX)

nausikt · 2026-04-08T20:10:44Z

Agreed with what we discussed offline! converging...

nausikt added 30 commits April 7, 2026 01:03

scrapy project scaffolding for backend revamp, trivial twiki spider.

47dcb94

key scrapy setting with safe defaults.

1a6f6fc

scrapy check twiki, scrapy's magic to e2e test against any Item contr…

15f5b0c

…acts.

explicitly set RFPDupeFilter, proper Scrapy-Archi Item definitions.

9313696

added trivial LinkScraper implementation in scrapy, scrapers.utils, s…

1ac3468

…afe DEPTH_LIMIT.

Unit-testable parser practice with a trivial real Twiki parser offlin…

9035fe6

…e test-cases.

scrapers resource adapter, scrapy Item -> Archi's ScrapedResource.

8e9ac37

preserve source_type=web for now, rearrange resource_adapter & unit-t…

94ea4da

…ests.

generic items, adapters, pipelines which encourage OCP.

1c666aa

generic LinkSpider for subclassing, clear Open/Closed boundaries + So…

b7fc00f

…C via parsers, introduce extension points parse_item, parse_follow_links, default implementation, toscrape and twiki example.

refactor LinkSpider, accepted/accounted real Twiki-usecases configura…

dfce70d

…ble crawler args.

cleaner by put the scrapy contracts under parse, scrapy checkable on …

649b5c7

…both toscrape, twiki examples.

refactored how to proper crawler engine settings, safe default values…

daba5c2

… for toscrape, twiki.

cleanest way to normalize url via LinkExtractor's process_value, Real…

2ab074a

…-world twiki crawling contracts with saftest default values.

add more generic twiki default_deny patterns.

b4e5192

refactored AuthProvider, Middlewares, with clear OCP, SoC boundary, w…

742811a

…orks in HeavyIon use-cases.

decouple git collection from scrapers; add GitResource and GitManager.

7fa9073

add scraper_manager at collectors level, e2e wired from legacy interf…

6128af4

…ace & instantiations to GitManager and Scrapy's ScraperManager.

remove deprecated scrapers and dead code before interface revision.

c4da024

introduce generic link item parser, removed toscrape example.

b62c273

base LinkSpider support allowed_domains.

08d1ddc

dynamic scraper source loader, shared PersistencePipeline among scrap…

355c128

…ers, wired to legacy interface.

revert to proper default w/o sso-enforced.

6da57f8

scrapers params will be under source.web, fix bugs.

38ca452

migrate to new interface.

9903f21

add basic-scraping example.

88f0fee

workaround backward compatible, Git credential should be optional.

f8c8ee3

bug fixes, spider_loader can't find scrapy settings, sso-scraping dep…

70efb66

…s, refactor interface, GitManager works, link, twiki public + sso sources example.

fix bugs twiki non-text response leak, playwright handler can't start…

ad7e7b3

… if not main thread, bugs scraper_manager not aware of enabled flag.

turn-on/test twiki.

be9b125

nausikt added 17 commits April 7, 2026 01:12

less noisy, more clean & robust twiki generic parsed response.

4b7d499

no links,pdfs has been discard, set title, much robust, collect body …

25af09c

…as doms convert to markitdown later.

ignored old doc and archives format, more robust body extractor.

3549679

fix scrapy will add . for us!

536b2db

informative logging about follow_links.

c399241

clean basic-scraping config example.

2142948

[Dicourse] support recursion/iterator-based scraper, with cern_sso

d1356d9

[Discourse] refined interfaces and config example.

3f12867

[Discourse] ScraperManager now support iterative-based Spider, no dep…

621a5a3

…th limit

[Discourse] bring back full example

c559664

[Discourse] workaround store rss as html, only best support format by…

995f242

… vectorsotre manager.

[Discourse] scraped resource url better have no .rss

7a1c6f5

scrapers support built-in anonymization

c75e8e5

[Anonymizer][Discourse, Twiki] cover markups html, rss as much as pos…

8aec27e

…sible, should refactoring later to have structure/unstructure redact more separately.

[Markitdown] support straight forward markitdown with second pass ano…

8aa441d

…nymized.

realistic, comprehensive test configurations.

0656666

fix renaming missing from ref PR.

c0bd84b

nausikt mentioned this pull request Apr 8, 2026

Migrate Scrapers to Scrapy #546

Open

7 tasks

fix markitdown dep was missing.

456ab3d

remove noises from local deployments.

ace75e1

moved deps to proper requirements.txt

6b9dfd5

pmlugato reviewed Apr 8, 2026

View reviewed changes

nausikt commented Apr 8, 2026

View reviewed changes

Conversation

nausikt commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

End-to-End Integrated Archi Infrastructure Test

Standalone Scrapy Test

Uh oh!

nausikt commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pmlugato commented Apr 8, 2026

Uh oh!

nausikt commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pmlugato Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

nausikt Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

pmlugato Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nausikt Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

nausikt Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

nausikt Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nausikt Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nausikt Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nausikt Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

nausikt Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pmlugato commented Apr 8, 2026

Uh oh!

nausikt commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nausikt commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nausikt commented Apr 8, 2026 •

edited

Loading

nausikt commented Apr 8, 2026 •

edited

Loading

nausikt commented Apr 8, 2026 •

edited

Loading

pmlugato Apr 8, 2026 •

edited

Loading

nausikt Apr 8, 2026 •

edited

Loading

nausikt Apr 8, 2026 •

edited

Loading

nausikt Apr 8, 2026 •

edited

Loading

nausikt Apr 8, 2026 •

edited

Loading

nausikt commented Apr 8, 2026 •

edited

Loading