Skip to content

[Ref] Integrate Scrapy-based scrapers into Archi interfaces#547

Open
nausikt wants to merge 55 commits intoarchi-physics:devfrom
nausikt:ref/scrapers-to-scrapy
Open

[Ref] Integrate Scrapy-based scrapers into Archi interfaces#547
nausikt wants to merge 55 commits intoarchi-physics:devfrom
nausikt:ref/scrapers-to-scrapy

Conversation

@nausikt
Copy link
Copy Markdown

@nausikt nausikt commented Apr 8, 2026

Resolve #546 #464

There's only one culprit for SSO playwright + chromium to work as expected. please beware of ...

  • On macOS with an ARM-based CPU, please make sure Rosetta is enabled for your Podman machine, or available on your laptop and being used naturally by container.

Otherwise, it may fall back to a dedicated QEMU-emulated Chromium instance, which cause race conditions!

  • Other Linux or x86_64, production machine, VMs should works fine.

End-to-End Integrated Archi Infrastructure Test

The main caveat is when seed URLs mix SSO-protected and public pages, such as TWiki.
In those cases, please list the SSO-protected seed URLs First, followed by the remaining public/SSO-mixed URLs.

        twiki:
            auth_provider_name: cern_sso   # we can safely remove if we crawl only just public pages.
            anonymize_data: true
            markitdown: true
            urls: # If any TWiki URLs are SSO-protected, please list them first for efficiency and robustness
                - https://twiki.cern.ch/twiki/bin/view/CMS/HeavyIons # sso-protected twiki pages.
                - https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuide # public twiki seed urls.
archi create --name basic-scraping --config examples/deployments/basic-scraping/config.yaml -e .env --podman --services chatbot --verbosity=4 --force

Please, feels free to adjust the example/deployment/basic-scraping/config.yaml as you see fit, Currently, it might take very long time to pass through every web sources.

  • LinkSpider for MIT sources

    may take less than 1 hr.

  • BUT... TWiki HeavyIon might take

    at least (300 + 100 from CRAB)++ docs x 60 seconds (crawl delay)
    ~= 24,000 seconds (6-7hrs)

  • Discourse may take

    10 seconds (crawl delay) each (47 pages + at least 500 docs)
    ~= 1 hr 40 mins

  • new GitManager ~= as i've tested on dmwm/CRABServer and DMWM/CRABClient

    2 repos have ~400 files
    its take less than an hr, work efficiently well like in the past.

Expected Result
[1] All comprehensive sources are ingesting which might take hrs to finished.
Screenshot 2026-04-08 at 18 53 22
[2] Also, most of our docs should be in anonymized, markdown format right away.
P.S. been anonymized as best as NLP module and our heuristic-based (hard-code) name-replacement patterns per source can.
575417008-f7cccaad-f416-4415-ada5-978217ad0a59

Standalone Scrapy Test

more details are comming soon...

nausikt added 30 commits April 7, 2026 01:03
…C via parsers, introduce extension points parse_item, parse_follow_links, default implementation, toscrape and twiki example.
…-world twiki crawling contracts with saftest default values.
…ace & instantiations to GitManager and Scrapy's ScraperManager.
…s, refactor interface, GitManager works, link, twiki public + sso sources example.
… if not main thread, bugs scraper_manager not aware of enabled flag.
@nausikt nausikt mentioned this pull request Apr 8, 2026
7 tasks
@nausikt
Copy link
Copy Markdown
Author

nausikt commented Apr 8, 2026

Note that we still lack the mit_sso auth provider, which will be added ASAP.

For now, we can work around this by testing MIT sources and CERN sources in a mutually exclusive manner.

@pmlugato
Copy link
Copy Markdown
Collaborator

pmlugato commented Apr 8, 2026

Hi @nausikt , thanks for this PR! Looking into it and testing now -- one thing I noticed right away, please add any new packages to the requirements files -- having them in the pyproject is fine for now for testing but this way they the images will be updated accordingly once merged to main

@nausikt
Copy link
Copy Markdown
Author

nausikt commented Apr 8, 2026

Oh dear, let me fix it now!

Comment on lines +44 to +51
urls:
- https://ppc.mit.edu/news/
max_depth: 2
max_pages: 100
delay: 10
markitdown: true
input_lists:
- examples/deployments/basic-scraping/miscellanea.list
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you can both pass urls directly in the config, or via a .list file with urls as before? For the latter case, the file can have both public and sso, without any need for a prefix, right?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly Pietro! for all scrapers (spiders) by design.

Comment on lines +92 to +95
git:
urls:
- https://github.com/dmwm/CRABServer
- https://github.com/dmwm/CRABClient
Copy link
Copy Markdown
Collaborator

@pmlugato pmlugato Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unlike web.links, web.git and web.twiki only take the urls directly in the config? if you can also support an input_list here, would be nice for people who want to keep it separate.

Also, is there a reason twiki takes a separate list? I understand the specific configuration options it needs, but why can't the twiki urls live in same list as other links and twiki options would be under web.links.twiki? to support different delay times more easily maybe?

Comment: I think it's better to have, e.g., links.list and git.list, as it would be in this case, than how we had it before with the prefixes, so separate input lists is good.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unlike web.links, web.git and web.twiki only take the urls directly in the config? if you can also support an input_list here, would be nice for people who want to keep it separate.

For web.git now, Yes only take urls. Roger that, will support input_list there!

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

web.link, web.twiki and web.discourse support both urls and input_list.

Copy link
Copy Markdown
Author

@nausikt nausikt Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is there a reason twiki takes a separate list? I understand the specific configuration options it needs, but why can't the twiki urls live in same list as other links and twiki options would be under web.links.twiki? to support different delay times more easily maybe?

@pmlugato Got your point! agree on UX can be better by not taking separate list, will converging!🫡

For clarity, only 1web.links.input_lists as a main portal, site/scraper specific asides under web.links.<site|spider> pattern.

    web:
        links:
          ### Global Default Link configs go here, implicitly set here
          max_depth: 2
          max_pages: 100
          delay: 10
          markitdown: true
          input_lists:  # Only 1 portal, we pour every links here... as long as it fits scraper nature.
          - examples/deployments/basic-scraping/miscellanea.list
          ### Site/Spider non-list-related/specific configuration goes belows.
          # <spider/site>:
          twiki:
              delay: 60  # <---- override global
              deny: ....
              allow: ....
              # .... w/o list
          discourse: 
              category_paths:
                - ....
              delay: 10
              keywords: ...
          indico:   # <--- upcoming IncdioSpider will stay in this level as well.
              keywords:

Then behind the scene, Archi scraper_manager automatically aware of Spider and there config via newly introduced registry instead.

DOMAIN_SPIDER_REGISTRY = {
    "twiki.cern.ch": TWikiSpider,
    "cms-talk.web.cern.ch": DiscourseSpider,
    "indico.cern.ch": IndicoSpider,
}

Is this what you had in mind? but this registry has to be configured by users somewhere anyhow🤔

Otherwise, how can scraper_manager resolved link and know source_kind from flat list of urls? but i don't like prefix approach! IMO it's deteriorate UX.

Copy link
Copy Markdown
Author

@nausikt nausikt Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just added domain for scraper_manager to pick-up as registry!

          links:
              input_lists:
              - ....
          twiki:
              domain: "twiki.cern.ch"
              delay: 60  # <---- override global
              deny: ....
              allow: ....
              # .... w/o list
          discourse: 
              domain: "cms-talk.web.cern.ch"
                - ....
              delay: 10
              keywords: ...
          indico:   # <--- upcoming IncdioSpider will stay in this level as well.
              domain: "indico.cern.ch"
              keywords:

Copy link
Copy Markdown
Author

@nausikt nausikt Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way we give users best UX, resolving spider are transparent and have least footprint to the users's config.

with probably only caveats[?] for now, like we can't have 2 twiki/discourse/indico source instances at the same time.

          indico:
              domain: "indico.cern.ch", "indico.mit.edu"
              keywords: ...

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment: I think it's better to have, e.g., links.list and git.list, as it would be in this case, than how we had it before with the prefixes, so separate input lists is good.

Ack for the comment,

BTW, about naming,

  1. would you prefer me to rename git.input_lists -> git.list as well? or avoid list words and remain the same?
  2. would you like me to completely drop git.urls, web.links.urls and just change to input_lists style to avoid confusing users? preserve only 1 standard way?.
  • Although, personally, web.urls are fairly convenience for me to peek/edit everything with just a glance Archi in the same config.yaml but it may be just better UX for debugging things*

Comment on lines +271 to +283
if not isinstance(sources_section, dict):
continue
web = sources_section.get("web", {}) or {}
if not isinstance(web, dict):
continue
for spider_key, sub in web.items():
if spider_key in _WEB_TOP_LEVEL_STATIC_KEYS:
continue
if not isinstance(sub, dict):
continue
wlists = sub.get("input_lists") or []
if isinstance(wlists, list):
collected.extend(wlists)
Copy link
Copy Markdown
Author

@nausikt nausikt Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmlugato input_lists are understood by all scrapers (Spider) here and no prefix like sso-, eos-, or git-* is required in any urls.

For SSO-protected urls, though, we must explicitly provide the right auth_provider_name: cern_sso|.... ourself. and the SSO-protected url has to come first, but by design this should not be a caveat. URLs should works in any order!

@pmlugato
Copy link
Copy Markdown
Collaborator

pmlugato commented Apr 8, 2026

@nausikt thanks a lot for this PR, it looks really nice! I have tested it with a few different source types and configurations and everything seems to be working smoothly which is great! Thanks a lot also for the active iteration offline :)

I would be happy to merge this soon into dev, maybe we can have one more person look at it but can also be done after the fact...

One request before doing so is the following: if you could write some nice documentation about all of this in the docs/ it would be great, including deprecating the old version of things there. Once that's done, I think we should be almost good to go into dev.

Thanks a lot again for all the hard work!

@nausikt
Copy link
Copy Markdown
Author

nausikt commented Apr 8, 2026

@pmlugato Thanks to you guys for having me on board!

One last thing! i've summarized all my micro-repsonses for you to finalize here.
Before, I dive into converging to your review & and docs.

    web:
        links:
          ### Global Default Link configs go here, implicitly set here
          max_depth: 2
          max_pages: 100
          delay: 10
          markitdown: true
          # Only 1 input_lists portal, we pour every links here..
          input_lists:
          - examples/deployments/basic-scraping/miscellanea.list
          ### Site/Spider non-list-related/specific configuration goes belows.
          # <spider/site>:
          twiki:
              domain: "twiki.cern.ch"
              delay: 60  # <---- override global
              deny: ....
              allow: ....
              # .... w/o list
          discourse:
              domain: "cms-talk.web.cern.ch"
                - ....
              delay: 10
              keywords: ...
          indico:   # <--- upcoming IncdioSpider will stay in this level as well.
              domain: "indico.cern.ch"
              keywords:
  1. Is this ideal you had in mind?
  2. I will deprecated <manager>.urls and stick with web.links.input_lists, git.input_lists pattern/style
  3. shall we? reduced to web.input_lists no need for redundant links in web.links (I highly encourage this, nested/indentation have a significant chances of confuse users, this should be better UX)

@nausikt
Copy link
Copy Markdown
Author

nausikt commented Apr 8, 2026

Agreed with what we discussed offline! converging...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants