-
Notifications
You must be signed in to change notification settings - Fork 47
[Ref] Integrate Scrapy-based scrapers into Archi interfaces #547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Changes from all commits
47dcb94
1a6f6fc
15f5b0c
9313696
1ac3468
9035fe6
8e9ac37
94ea4da
1c666aa
b7fc00f
dfce70d
649b5c7
daba5c2
2ab074a
b4e5192
742811a
7fa9073
6128af4
c4da024
b62c273
08d1ddc
355c128
6da57f8
38ca452
9903f21
88f0fee
f8c8ee3
70efb66
ad7e7b3
be9b125
42893f5
e56fa7f
34bf2e3
e0a8df0
7acda5d
4b7d499
25af09c
3549679
536b2db
c399241
2142948
d1356d9
3f12867
621a5a3
c559664
995f242
7a1c6f5
c75e8e5
8aec27e
8aa441d
0656666
c0bd84b
456ab3d
ace75e1
6b9dfd5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,98 @@ | ||
| # Basic configuration file for a Archi deployment | ||
| # with a chat app interface, agent, and | ||
| # PostgreSQL with pgvector for document storage. | ||
| # The LLM is used through an existing Ollama server. | ||
| # | ||
| # run with: | ||
| # archi create --name my-archi-scraping --config examples/deployments/basic-scraping/config.yaml --services chatbot --hostmode | ||
|
|
||
| # Deployment example for CERN data sources: | ||
| # Twiki (with optional SSO) + public links + Git repos | ||
| # | ||
| # Required env vars for SSO: | ||
| # SSO_USERNAME=xxx SSO_PASSWORD=yyy | ||
|
|
||
| name: my_archi | ||
|
|
||
| services: | ||
| data_manager: | ||
| port: 7872 | ||
| chat_app: | ||
| agent_class: CMSCompOpsAgent | ||
| agents_dir: examples/agents | ||
| default_provider: local | ||
| default_model: qwen3:32b | ||
| providers: | ||
| local: | ||
| enabled: true | ||
| base_url: http://submit76.mit.edu:7870 # make sure this matches your ollama server URL! | ||
| mode: ollama | ||
| default_model: "qwen3:32b" # make sure this matches a model you have downloaded locally with ollama | ||
| models: | ||
| - "qwen3:32b" | ||
| trained_on: "My data" | ||
| port: 7868 | ||
| external_port: 7868 | ||
| vectorstore: | ||
| backend: postgres # PostgreSQL with pgvector (only supported backend) | ||
|
|
||
| data_manager: | ||
| embedding_name: HuggingFaceEmbeddings | ||
| sources: | ||
| web: | ||
| link: | ||
| urls: | ||
| - https://ppc.mit.edu/news/ | ||
| max_depth: 2 | ||
| max_pages: 100 | ||
| delay: 10 | ||
| markitdown: true | ||
| input_lists: | ||
| - examples/deployments/basic-scraping/miscellanea.list | ||
| twiki: | ||
| auth_provider_name: cern_sso # remove if crawling public pages only | ||
| anonymize_data: true | ||
| urls: # as of now, if we have sso-proteccted twiki, please put it first for efficiency and robustness. | ||
| - https://twiki.cern.ch/twiki/bin/view/CMS/HeavyIons # sso-protected twiki pages. | ||
| - https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuide # public twiki seed urls. | ||
| allow: | ||
| - ".*CRAB3.*" | ||
| - ".*SWGuide.*" | ||
| - ".*WorkBook.*" | ||
| - ".*Crab.*" | ||
| - ".*Crab3.*" | ||
| # Crawled all possible HeavyIons + a little bit of CRAB | ||
| # - ".*HeavyIons.*" | ||
| # - ".*HICollisions.*" | ||
| # - ".*HIRel.*" | ||
| deny: | ||
| - ".*WorkBook.*" | ||
| max_depth: 2 | ||
| max_pages: 1000 | ||
| anonymize_data: true | ||
| markitdown: true | ||
| delay: 60 | ||
| discourse: # we should get approximately 500/800++ anonymized markdown discussions. | ||
| auth_provider_name: cern_sso | ||
| base_url: https://cms-talk.web.cern.ch | ||
| delay: 10 | ||
| max_pages: 1000 | ||
| anonymize_data: true | ||
| markitdown: true | ||
| category_paths: | ||
| - /c/offcomp/comptools/87 | ||
| # - /c/offcomp/ais/150 | ||
| keywords: | ||
| - "Stefano Belforte" | ||
| - "Katy Ellis" | ||
| - "Krittin Phornsiricharoenphant" | ||
| - "Vijay Chakravarty" | ||
| - "Dario Mapelli" | ||
| - "Thanayut Seethongchuen" | ||
| git: | ||
| urls: | ||
| - https://github.com/dmwm/CRABServer | ||
| - https://github.com/dmwm/CRABClient | ||
|
Comment on lines
+92
to
+95
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. unlike Also, is there a reason twiki takes a separate list? I understand the specific configuration options it needs, but why can't the twiki urls live in same list as other links and twiki options would be under Comment: I think it's better to have, e.g.,
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
For
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
@pmlugato Got your point! agree on UX can be better by not taking separate list, will converging!🫡 For clarity, only 1 web:
links:
### Global Default Link configs go here, implicitly set here
max_depth: 2
max_pages: 100
delay: 10
markitdown: true
input_lists: # Only 1 portal, we pour every links here... as long as it fits scraper nature.
- examples/deployments/basic-scraping/miscellanea.list
### Site/Spider non-list-related/specific configuration goes belows.
# <spider/site>:
twiki:
delay: 60 # <---- override global
deny: ....
allow: ....
# .... w/o list
discourse:
category_paths:
- ....
delay: 10
keywords: ...
indico: # <--- upcoming IncdioSpider will stay in this level as well.
keywords:
Then behind the scene, Archi scraper_manager automatically aware of Spider and there config via newly introduced registry instead. DOMAIN_SPIDER_REGISTRY = {
"twiki.cern.ch": TWikiSpider,
"cms-talk.web.cern.ch": DiscourseSpider,
"indico.cern.ch": IndicoSpider,
}Is this what you had in mind? but this registry has to be configured by users somewhere anyhow🤔 Otherwise, how can scraper_manager resolved link and know source_kind from flat list of urls? but i don't like prefix approach! IMO it's deteriorate UX.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How about just added links:
input_lists:
- ....
twiki:
domain: "twiki.cern.ch"
delay: 60 # <---- override global
deny: ....
allow: ....
# .... w/o list
discourse:
domain: "cms-talk.web.cern.ch"
- ....
delay: 10
keywords: ...
indico: # <--- upcoming IncdioSpider will stay in this level as well.
domain: "indico.cern.ch"
keywords:
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This way we give users best UX, resolving spider are transparent and have least footprint to the users's config. with probably only caveats[?] for now, like we can't have 2 twiki/discourse/indico source instances at the same time. indico:
domain: "indico.cern.ch", "indico.mit.edu"
keywords: ...
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Ack for the comment, BTW, about naming,
|
||
| utils: | ||
| anonymizer: | ||
| nlp_model: en_core_web_sm | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,49 @@ | ||
| # PPC | ||
| https://ppc.mit.edu/blog/2016/05/08/hello-world/ | ||
| https://ppc.mit.edu/ | ||
| https://ppc.mit.edu/christoph-paus/ | ||
| https://ppc.mit.edu/dmytro-kovalskyi/ | ||
| https://ppc.mit.edu/gomez-ceballos/ | ||
| https://ppc.mit.edu/blog/2024/11/23/lhc-finishes-a-record-year/ | ||
| https://ppc.mit.edu/blog/2024/12/02/felicidades-cecilia/ | ||
| https://ppc.mit.edu/blog/2015/05/21/clipboard/ | ||
| https://ppc.mit.edu/blog/2025/01/12/published-first-diboson-paper-using-run-3-lhc-data/ | ||
| https://ppc.mit.edu/blog/2025/01/23/student-fcc-workshop-at-mit-v3-2025/ | ||
| https://ppc.mit.edu/blog/2025/01/23/new-chill-in-middleton/ | ||
| https://ppc.mit.edu/blog/2025/01/24/first-linux-server-installation-for-david-and-pietro/ | ||
| https://ppc.mit.edu/blog/2025/01/26/from-cern-to-mit-for-the-fcc-workshop/ | ||
| https://ppc.mit.edu/publications/ | ||
| https://ppc.mit.edu/blog/2025/02/08/detailed-schedule-for-the-european-strategy/ | ||
| https://ppc.mit.edu/blog/2025/02/14/first-cms-week-in-2025/ | ||
| https://ppc.mit.edu/blog/2025/02/18/exploring-the-higgs-boson-in-our-latest-result/ | ||
| https://ppc.mit.edu/blog/2025/02/04/news-from-the-chamonix-meeting/ | ||
| https://ppc.mit.edu/blog/2025/02/11/cms-data-archival-at-mit/ | ||
| https://ppc.mit.edu/blog/2025/03/28/cern-gets-support-from-canada/ | ||
| https://ppc.mit.edu/blog/2025/04/08/breakthrough-prize-in-physics-2025/ | ||
| https://ppc.mit.edu/blog/2025/04/04/the-fcc-at-cern-a-feasibly-circular-collider/ | ||
| https://ppc.mit.edu/blog/2025/04/08/cleo-reached-magic-issue-number-5000/ | ||
| https://ppc.mit.edu/blog/2025/04/14/maximizing-cms-competitive-advantage/ | ||
| https://ppc.mit.edu/blog/2025/04/25/sueps-at-aps-march-april-meeting/ | ||
| https://ppc.mit.edu/blog/2025/04/18/round-three/ | ||
| https://ppc.mit.edu/blog/2025/04/14/first-beams-with-a-splash-in-2025/ | ||
| https://ppc.mit.edu/blog/2025/05/27/fcc-weak-in-vienna-building-our-future/ | ||
| https://ppc.mit.edu/blog/2025/06/04/new-paper-on-arxiv-submit-a-physics-analysis-facility-at-mit/ | ||
| https://ppc.mit.edu/blog/2025/06/16/summer-cms-week-2025/ | ||
| https://ppc.mit.edu/blog/2025/05/05/cms-records-first-2025-high-energy-collisions/ | ||
| https://ppc.mit.edu/blog/2025/06/17/long-term-vision-for-particle-physics-from-the-national-academies/ | ||
| https://ppc.mit.edu/blog/2025/06/20/conclusion-of-junes-cern-council-session-has-major-consequences-for-cms/ | ||
| https://ppc.mit.edu/blog/2025/06/20/highest-pileup-recorded-at-cms-last-night/ | ||
| https://ppc.mit.edu/blog/2025/06/25/selfie-station-at-wilson-hall/ | ||
| https://ppc.mit.edu/mariarosaria-dalfonso/ | ||
| https://ppc.mit.edu/kenneth-long-2/ | ||
| https://ppc.mit.edu/blog/2025/06/27/open-symposium-on-the-european-strategy-for-particle-physics/ | ||
| https://ppc.mit.edu/blog/2025/07/03/bridging-physics-and-computing-throughput-computing-2025/ | ||
| https://ppc.mit.edu/pietro-lugato-2/ | ||
| https://ppc.mit.edu/luca-lavezzo/ | ||
| https://ppc.mit.edu/zhangqier-wang-2/ | ||
| https://ppc.mit.edu/blog/2025/07/14/welcome-our-first-ever-in-house-masters-student/ | ||
| # A2 | ||
| https://ppc.mit.edu/a2/ | ||
| # Personnel | ||
| https://people.csail.mit.edu/kraska | ||
| https://physics.mit.edu/faculty/christoph-paus |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,2 @@ | ||
| [settings] | ||
| default = src.data_manager.collectors.scrapers.settings |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -13,6 +13,8 @@ | |
|
|
||
| STATIC_FIELDS = ['global', 'services'] | ||
|
|
||
| _WEB_TOP_LEVEL_STATIC_KEYS = ["enabled", "visible"] | ||
|
|
||
| class ConfigurationManager: | ||
| """Manages archi configuration loading and validation""" | ||
|
|
||
|
|
@@ -266,10 +268,19 @@ def _collect_input_lists(self) -> None: | |
| for conf in self.configs: | ||
| data_manager = conf.get('data_manager', {}) | ||
| sources_section = data_manager.get('sources', {}) or {} | ||
| links_section = sources_section.get('links', {}) if isinstance(sources_section, dict) else {} | ||
| lists = links_section.get('input_lists') or [] | ||
| if isinstance(lists, list): | ||
| collected.extend(lists) | ||
| if not isinstance(sources_section, dict): | ||
| continue | ||
| web = sources_section.get("web", {}) or {} | ||
| if not isinstance(web, dict): | ||
| continue | ||
| for spider_key, sub in web.items(): | ||
| if spider_key in _WEB_TOP_LEVEL_STATIC_KEYS: | ||
| continue | ||
| if not isinstance(sub, dict): | ||
| continue | ||
| wlists = sub.get("input_lists") or [] | ||
| if isinstance(wlists, list): | ||
| collected.extend(wlists) | ||
|
Comment on lines
+271
to
+283
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @pmlugato input_lists are understood by all scrapers (Spider) here and no prefix like sso-, eos-, or git-* is required in any urls. For SSO-protected urls, though, we must explicitly provide the right |
||
| self.input_list = sorted(set(collected)) if collected else [] | ||
|
|
||
| def get_enabled_sources(self) -> List[str]: | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so you can both pass urls directly in the config, or via a
.listfile with urls as before? For the latter case, the file can have both public and sso, without any need for a prefix, right?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly Pietro! for all scrapers (spiders) by design.