Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
47dcb94
scrapy project scaffolding for backend revamp, trivial twiki spider.
nausikt Mar 22, 2026
1a6f6fc
key scrapy setting with safe defaults.
nausikt Mar 22, 2026
15f5b0c
scrapy check twiki, scrapy's magic to e2e test against any Item contr…
nausikt Mar 22, 2026
9313696
explicitly set RFPDupeFilter, proper Scrapy-Archi Item definitions.
nausikt Mar 22, 2026
1ac3468
added trivial LinkScraper implementation in scrapy, scrapers.utils, s…
nausikt Mar 22, 2026
9035fe6
Unit-testable parser practice with a trivial real Twiki parser offlin…
nausikt Mar 22, 2026
8e9ac37
scrapers resource adapter, scrapy Item -> Archi's ScrapedResource.
nausikt Mar 22, 2026
94ea4da
preserve source_type=web for now, rearrange resource_adapter & unit-t…
nausikt Mar 22, 2026
1c666aa
generic items, adapters, pipelines which encourage OCP.
nausikt Mar 29, 2026
b7fc00f
generic LinkSpider for subclassing, clear Open/Closed boundaries + So…
nausikt Mar 30, 2026
dfce70d
refactor LinkSpider, accepted/accounted real Twiki-usecases configura…
nausikt Mar 30, 2026
649b5c7
cleaner by put the scrapy contracts under parse, scrapy checkable on …
nausikt Mar 30, 2026
daba5c2
refactored how to proper crawler engine settings, safe default values…
nausikt Mar 30, 2026
2ab074a
cleanest way to normalize url via LinkExtractor's process_value, Real…
nausikt Mar 30, 2026
b4e5192
add more generic twiki default_deny patterns.
nausikt Mar 30, 2026
742811a
refactored AuthProvider, Middlewares, with clear OCP, SoC boundary, w…
nausikt Mar 31, 2026
7fa9073
decouple git collection from scrapers; add GitResource and GitManager.
nausikt Mar 31, 2026
6128af4
add scraper_manager at collectors level, e2e wired from legacy interf…
nausikt Mar 31, 2026
c4da024
remove deprecated scrapers and dead code before interface revision.
nausikt Mar 31, 2026
b62c273
introduce generic link item parser, removed toscrape example.
nausikt Mar 31, 2026
08d1ddc
base LinkSpider support allowed_domains.
nausikt Mar 31, 2026
355c128
dynamic scraper source loader, shared PersistencePipeline among scrap…
nausikt Mar 31, 2026
6da57f8
revert to proper default w/o sso-enforced.
nausikt Mar 31, 2026
38ca452
scrapers params will be under source.web, fix bugs.
nausikt Mar 31, 2026
9903f21
migrate to new interface.
nausikt Mar 31, 2026
88f0fee
add basic-scraping example.
nausikt Mar 31, 2026
f8c8ee3
workaround backward compatible, Git credential should be optional.
nausikt Mar 31, 2026
70efb66
bug fixes, spider_loader can't find scrapy settings, sso-scraping dep…
nausikt Mar 31, 2026
ad7e7b3
fix bugs twiki non-text response leak, playwright handler can't start…
nausikt Apr 1, 2026
be9b125
turn-on/test twiki.
nausikt Apr 1, 2026
42893f5
fix bug incorrectly copy/resolve input_lists from container weblists …
nausikt Apr 1, 2026
e56fa7f
test only twiki, no longers needed backwardcompatible workaround.
nausikt Apr 1, 2026
34bf2e3
sources.links no longer exists, new sources.web have no required fields.
nausikt Apr 1, 2026
e0a8df0
git and sso as well no longer depend on sources.links.
nausikt Apr 1, 2026
7acda5d
fix twiki is too strict for user, has to always set enabled=True;
nausikt Apr 1, 2026
4b7d499
less noisy, more clean & robust twiki generic parsed response.
nausikt Apr 1, 2026
25af09c
no links,pdfs has been discard, set title, much robust, collect body …
nausikt Apr 1, 2026
3549679
ignored old doc and archives format, more robust body extractor.
nausikt Apr 1, 2026
536b2db
fix scrapy will add `.` for us!
nausikt Apr 1, 2026
c399241
informative logging about follow_links.
nausikt Apr 1, 2026
2142948
clean basic-scraping config example.
nausikt Apr 2, 2026
d1356d9
[Dicourse] support recursion/iterator-based scraper, with cern_sso
nausikt Apr 2, 2026
3f12867
[Discourse] refined interfaces and config example.
nausikt Apr 2, 2026
621a5a3
[Discourse] ScraperManager now support iterative-based Spider, no dep…
nausikt Apr 2, 2026
c559664
[Discourse] bring back full example
nausikt Apr 2, 2026
995f242
[Discourse] workaround store rss as html, only best support format by…
nausikt Apr 2, 2026
7a1c6f5
[Discourse] scraped resource url better have no .rss
nausikt Apr 2, 2026
c75e8e5
scrapers support built-in anonymization
nausikt Apr 2, 2026
8aec27e
[Anonymizer][Discourse, Twiki] cover markups html, rss as much as pos…
nausikt Apr 4, 2026
8aa441d
[Markitdown] support straight forward markitdown with second pass ano…
nausikt Apr 5, 2026
0656666
realistic, comprehensive test configurations.
nausikt Apr 8, 2026
c0bd84b
fix renaming missing from ref PR.
nausikt Apr 8, 2026
456ab3d
fix markitdown dep was missing.
nausikt Apr 8, 2026
ace75e1
remove noises from local deployments.
nausikt Apr 8, 2026
6b9dfd5
moved deps to proper requirements.txt
nausikt Apr 8, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
98 changes: 98 additions & 0 deletions examples/deployments/basic-scraping/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Basic configuration file for a Archi deployment
# with a chat app interface, agent, and
# PostgreSQL with pgvector for document storage.
# The LLM is used through an existing Ollama server.
#
# run with:
# archi create --name my-archi-scraping --config examples/deployments/basic-scraping/config.yaml --services chatbot --hostmode

# Deployment example for CERN data sources:
# Twiki (with optional SSO) + public links + Git repos
#
# Required env vars for SSO:
# SSO_USERNAME=xxx SSO_PASSWORD=yyy

name: my_archi

services:
data_manager:
port: 7872
chat_app:
agent_class: CMSCompOpsAgent
agents_dir: examples/agents
default_provider: local
default_model: qwen3:32b
providers:
local:
enabled: true
base_url: http://submit76.mit.edu:7870 # make sure this matches your ollama server URL!
mode: ollama
default_model: "qwen3:32b" # make sure this matches a model you have downloaded locally with ollama
models:
- "qwen3:32b"
trained_on: "My data"
port: 7868
external_port: 7868
vectorstore:
backend: postgres # PostgreSQL with pgvector (only supported backend)

data_manager:
embedding_name: HuggingFaceEmbeddings
sources:
web:
link:
urls:
- https://ppc.mit.edu/news/
max_depth: 2
max_pages: 100
delay: 10
markitdown: true
input_lists:
- examples/deployments/basic-scraping/miscellanea.list
Comment on lines +44 to +51
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you can both pass urls directly in the config, or via a .list file with urls as before? For the latter case, the file can have both public and sso, without any need for a prefix, right?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly Pietro! for all scrapers (spiders) by design.

twiki:
auth_provider_name: cern_sso # remove if crawling public pages only
anonymize_data: true
urls: # as of now, if we have sso-proteccted twiki, please put it first for efficiency and robustness.
- https://twiki.cern.ch/twiki/bin/view/CMS/HeavyIons # sso-protected twiki pages.
- https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuide # public twiki seed urls.
allow:
- ".*CRAB3.*"
- ".*SWGuide.*"
- ".*WorkBook.*"
- ".*Crab.*"
- ".*Crab3.*"
# Crawled all possible HeavyIons + a little bit of CRAB
# - ".*HeavyIons.*"
# - ".*HICollisions.*"
# - ".*HIRel.*"
deny:
- ".*WorkBook.*"
max_depth: 2
max_pages: 1000
anonymize_data: true
markitdown: true
delay: 60
discourse: # we should get approximately 500/800++ anonymized markdown discussions.
auth_provider_name: cern_sso
base_url: https://cms-talk.web.cern.ch
delay: 10
max_pages: 1000
anonymize_data: true
markitdown: true
category_paths:
- /c/offcomp/comptools/87
# - /c/offcomp/ais/150
keywords:
- "Stefano Belforte"
- "Katy Ellis"
- "Krittin Phornsiricharoenphant"
- "Vijay Chakravarty"
- "Dario Mapelli"
- "Thanayut Seethongchuen"
git:
urls:
- https://github.com/dmwm/CRABServer
- https://github.com/dmwm/CRABClient
Comment on lines +92 to +95
Copy link
Copy Markdown
Collaborator

@pmlugato pmlugato Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unlike web.links, web.git and web.twiki only take the urls directly in the config? if you can also support an input_list here, would be nice for people who want to keep it separate.

Also, is there a reason twiki takes a separate list? I understand the specific configuration options it needs, but why can't the twiki urls live in same list as other links and twiki options would be under web.links.twiki? to support different delay times more easily maybe?

Comment: I think it's better to have, e.g., links.list and git.list, as it would be in this case, than how we had it before with the prefixes, so separate input lists is good.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unlike web.links, web.git and web.twiki only take the urls directly in the config? if you can also support an input_list here, would be nice for people who want to keep it separate.

For web.git now, Yes only take urls. Roger that, will support input_list there!

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

web.link, web.twiki and web.discourse support both urls and input_list.

Copy link
Copy Markdown
Author

@nausikt nausikt Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is there a reason twiki takes a separate list? I understand the specific configuration options it needs, but why can't the twiki urls live in same list as other links and twiki options would be under web.links.twiki? to support different delay times more easily maybe?

@pmlugato Got your point! agree on UX can be better by not taking separate list, will converging!🫡

For clarity, only 1web.links.input_lists as a main portal, site/scraper specific asides under web.links.<site|spider> pattern.

    web:
        links:
          ### Global Default Link configs go here, implicitly set here
          max_depth: 2
          max_pages: 100
          delay: 10
          markitdown: true
          input_lists:  # Only 1 portal, we pour every links here... as long as it fits scraper nature.
          - examples/deployments/basic-scraping/miscellanea.list
          ### Site/Spider non-list-related/specific configuration goes belows.
          # <spider/site>:
          twiki:
              delay: 60  # <---- override global
              deny: ....
              allow: ....
              # .... w/o list
          discourse: 
              category_paths:
                - ....
              delay: 10
              keywords: ...
          indico:   # <--- upcoming IncdioSpider will stay in this level as well.
              keywords:

Then behind the scene, Archi scraper_manager automatically aware of Spider and there config via newly introduced registry instead.

DOMAIN_SPIDER_REGISTRY = {
    "twiki.cern.ch": TWikiSpider,
    "cms-talk.web.cern.ch": DiscourseSpider,
    "indico.cern.ch": IndicoSpider,
}

Is this what you had in mind? but this registry has to be configured by users somewhere anyhow🤔

Otherwise, how can scraper_manager resolved link and know source_kind from flat list of urls? but i don't like prefix approach! IMO it's deteriorate UX.

Copy link
Copy Markdown
Author

@nausikt nausikt Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just added domain for scraper_manager to pick-up as registry!

          links:
              input_lists:
              - ....
          twiki:
              domain: "twiki.cern.ch"
              delay: 60  # <---- override global
              deny: ....
              allow: ....
              # .... w/o list
          discourse: 
              domain: "cms-talk.web.cern.ch"
                - ....
              delay: 10
              keywords: ...
          indico:   # <--- upcoming IncdioSpider will stay in this level as well.
              domain: "indico.cern.ch"
              keywords:

Copy link
Copy Markdown
Author

@nausikt nausikt Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way we give users best UX, resolving spider are transparent and have least footprint to the users's config.

with probably only caveats[?] for now, like we can't have 2 twiki/discourse/indico source instances at the same time.

          indico:
              domain: "indico.cern.ch", "indico.mit.edu"
              keywords: ...

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment: I think it's better to have, e.g., links.list and git.list, as it would be in this case, than how we had it before with the prefixes, so separate input lists is good.

Ack for the comment,

BTW, about naming,

  1. would you prefer me to rename git.input_lists -> git.list as well? or avoid list words and remain the same?
  2. would you like me to completely drop git.urls, web.links.urls and just change to input_lists style to avoid confusing users? preserve only 1 standard way?.
  • Although, personally, web.urls are fairly convenience for me to peek/edit everything with just a glance Archi in the same config.yaml but it may be just better UX for debugging things*

utils:
anonymizer:
nlp_model: en_core_web_sm
49 changes: 49 additions & 0 deletions examples/deployments/basic-scraping/miscellanea.list
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# PPC
https://ppc.mit.edu/blog/2016/05/08/hello-world/
https://ppc.mit.edu/
https://ppc.mit.edu/christoph-paus/
https://ppc.mit.edu/dmytro-kovalskyi/
https://ppc.mit.edu/gomez-ceballos/
https://ppc.mit.edu/blog/2024/11/23/lhc-finishes-a-record-year/
https://ppc.mit.edu/blog/2024/12/02/felicidades-cecilia/
https://ppc.mit.edu/blog/2015/05/21/clipboard/
https://ppc.mit.edu/blog/2025/01/12/published-first-diboson-paper-using-run-3-lhc-data/
https://ppc.mit.edu/blog/2025/01/23/student-fcc-workshop-at-mit-v3-2025/
https://ppc.mit.edu/blog/2025/01/23/new-chill-in-middleton/
https://ppc.mit.edu/blog/2025/01/24/first-linux-server-installation-for-david-and-pietro/
https://ppc.mit.edu/blog/2025/01/26/from-cern-to-mit-for-the-fcc-workshop/
https://ppc.mit.edu/publications/
https://ppc.mit.edu/blog/2025/02/08/detailed-schedule-for-the-european-strategy/
https://ppc.mit.edu/blog/2025/02/14/first-cms-week-in-2025/
https://ppc.mit.edu/blog/2025/02/18/exploring-the-higgs-boson-in-our-latest-result/
https://ppc.mit.edu/blog/2025/02/04/news-from-the-chamonix-meeting/
https://ppc.mit.edu/blog/2025/02/11/cms-data-archival-at-mit/
https://ppc.mit.edu/blog/2025/03/28/cern-gets-support-from-canada/
https://ppc.mit.edu/blog/2025/04/08/breakthrough-prize-in-physics-2025/
https://ppc.mit.edu/blog/2025/04/04/the-fcc-at-cern-a-feasibly-circular-collider/
https://ppc.mit.edu/blog/2025/04/08/cleo-reached-magic-issue-number-5000/
https://ppc.mit.edu/blog/2025/04/14/maximizing-cms-competitive-advantage/
https://ppc.mit.edu/blog/2025/04/25/sueps-at-aps-march-april-meeting/
https://ppc.mit.edu/blog/2025/04/18/round-three/
https://ppc.mit.edu/blog/2025/04/14/first-beams-with-a-splash-in-2025/
https://ppc.mit.edu/blog/2025/05/27/fcc-weak-in-vienna-building-our-future/
https://ppc.mit.edu/blog/2025/06/04/new-paper-on-arxiv-submit-a-physics-analysis-facility-at-mit/
https://ppc.mit.edu/blog/2025/06/16/summer-cms-week-2025/
https://ppc.mit.edu/blog/2025/05/05/cms-records-first-2025-high-energy-collisions/
https://ppc.mit.edu/blog/2025/06/17/long-term-vision-for-particle-physics-from-the-national-academies/
https://ppc.mit.edu/blog/2025/06/20/conclusion-of-junes-cern-council-session-has-major-consequences-for-cms/
https://ppc.mit.edu/blog/2025/06/20/highest-pileup-recorded-at-cms-last-night/
https://ppc.mit.edu/blog/2025/06/25/selfie-station-at-wilson-hall/
https://ppc.mit.edu/mariarosaria-dalfonso/
https://ppc.mit.edu/kenneth-long-2/
https://ppc.mit.edu/blog/2025/06/27/open-symposium-on-the-european-strategy-for-particle-physics/
https://ppc.mit.edu/blog/2025/07/03/bridging-physics-and-computing-throughput-computing-2025/
https://ppc.mit.edu/pietro-lugato-2/
https://ppc.mit.edu/luca-lavezzo/
https://ppc.mit.edu/zhangqier-wang-2/
https://ppc.mit.edu/blog/2025/07/14/welcome-our-first-ever-in-house-masters-student/
# A2
https://ppc.mit.edu/a2/
# Personnel
https://people.csail.mit.edu/kraska
https://physics.mit.edu/faculty/christoph-paus
3 changes: 3 additions & 0 deletions requirements/requirements-base.txt
Original file line number Diff line number Diff line change
Expand Up @@ -86,3 +86,6 @@ aiohttp==3.9.5
nltk==3.9.1
sentence-transformers==5.1.2
rank_bm25==0.2.2
Scrapy==2.14.2
playwright==1.58.0
markitdown==0.1.5
2 changes: 2 additions & 0 deletions scrapy.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[settings]
default = src.data_manager.collectors.scrapers.settings
5 changes: 2 additions & 3 deletions src/bin/service_data_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,9 +74,8 @@ def trigger_update() -> None:

schedule_map: Dict[str, Callable[[Optional[str]], None]] = {
"local_files": lambda last_run=None: data_manager.localfile_manager.schedule_collect_local_files(data_manager.persistence, last_run=last_run),
"links": lambda last_run=None: data_manager.scraper_manager.schedule_collect_links(data_manager.persistence, last_run=last_run),
"git": lambda last_run=None: data_manager.scraper_manager.schedule_collect_git(data_manager.persistence, last_run=last_run),
"sso": lambda last_run=None: data_manager.scraper_manager.schedule_collect_sso(data_manager.persistence, last_run=last_run),
"web": lambda last_run=None: data_manager.scraper_manager.schedule_collect(last_run=last_run),
"git": lambda last_run=None: data_manager.git_manager.schedule_collect_git(data_manager.persistence, last_run=last_run),
"jira": lambda last_run=None: data_manager.ticket_manager.schedule_collect_jira(data_manager.persistence, last_run=last_run),
"redmine": lambda last_run=None: data_manager.ticket_manager.schedule_collect_redmine(data_manager.persistence, last_run=last_run),
}
Expand Down
19 changes: 15 additions & 4 deletions src/cli/managers/config_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@

STATIC_FIELDS = ['global', 'services']

_WEB_TOP_LEVEL_STATIC_KEYS = ["enabled", "visible"]

class ConfigurationManager:
"""Manages archi configuration loading and validation"""

Expand Down Expand Up @@ -266,10 +268,19 @@ def _collect_input_lists(self) -> None:
for conf in self.configs:
data_manager = conf.get('data_manager', {})
sources_section = data_manager.get('sources', {}) or {}
links_section = sources_section.get('links', {}) if isinstance(sources_section, dict) else {}
lists = links_section.get('input_lists') or []
if isinstance(lists, list):
collected.extend(lists)
if not isinstance(sources_section, dict):
continue
web = sources_section.get("web", {}) or {}
if not isinstance(web, dict):
continue
for spider_key, sub in web.items():
if spider_key in _WEB_TOP_LEVEL_STATIC_KEYS:
continue
if not isinstance(sub, dict):
continue
wlists = sub.get("input_lists") or []
if isinstance(wlists, list):
collected.extend(wlists)
Comment on lines +271 to +283
Copy link
Copy Markdown
Author

@nausikt nausikt Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmlugato input_lists are understood by all scrapers (Spider) here and no prefix like sso-, eos-, or git-* is required in any urls.

For SSO-protected urls, though, we must explicitly provide the right auth_provider_name: cern_sso|.... ourself. and the SSO-protected url has to come first, but by design this should not be a caveat. URLs should works in any order!

self.input_list = sorted(set(collected)) if collected else []

def get_enabled_sources(self) -> List[str]:
Expand Down
18 changes: 8 additions & 10 deletions src/cli/source_registry.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,11 +23,9 @@ def __init__(self) -> None:
def _register_defaults(self) -> None:
self.register(
SourceDefinition(
name="links",
description="Basic HTTP/HTTPS link scraping from input lists",
required_config_fields=[
"data_manager.sources.links.input_lists",
],
name="web",
description="Basic HTTP/HTTPS, Scrapy web sources, seeds from urls and/or input_list",
required_config_fields=[],
)
)
self.register(
Expand All @@ -36,17 +34,17 @@ def _register_defaults(self) -> None:
description="SSO-backed web crawling",
required_secrets=["SSO_USERNAME", "SSO_PASSWORD"],
required_config_fields=[
"data_manager.sources.links.selenium_scraper.selenium_class",
"data_manager.sources.web",
],
depends_on=["links"],
depends_on=["web"],
)
)
self.register(
SourceDefinition(
name="git",
description="Git repository scraping for MkDocs-based documentation",
required_secrets=["GIT_USERNAME", "GIT_TOKEN"],
depends_on=["links"],
description="Git repository scraping for MkDocs-based documentation, Optional GIT_USERNAME/GIT_TOKEN for private repos.",
required_secrets=[], # was ["GIT_USERNAME", "GIT_TOKEN"]
depends_on=[], # no longer depends on links or webs, considered to be standalone manager.
)
)
self.register(
Expand Down
92 changes: 62 additions & 30 deletions src/cli/templates/base-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -188,40 +188,72 @@ data_manager:
{%- for path in paths %}
- {{ path }}
{%- endfor %}
links:
base_source_depth: {{ data_manager.sources.links.base_source_depth | default(1, true) }}
max_pages: {{ data_manager.sources.links.max_pages | default(null, true) }}
enabled: {{ data_manager.sources.links.enabled | default(true, true) }}
visible: {{ data_manager.sources.links.visible | default(true, true) }}
schedule: '{{ data_manager.sources.links.schedule | default("", true) }}'
input_lists:
{%- set link_lists = data_manager.sources.links.input_lists | default([], true) %}
{%- for input_list in link_lists %}
- {{ input_list }}
{%- endfor %}
html_scraper:
reset_data: {{ data_manager.sources.links.html_scraper.reset_data | default(true, true) }}
verify_urls: {{ data_manager.sources.links.html_scraper.verify_urls | default(false, true) }}
enable_warnings: {{ data_manager.sources.links.html_scraper.enable_warnings | default(false, true) }}
selenium_scraper:
enabled: {{ data_manager.sources.links.selenium_scraper.selenium_scraper.enabled | default(false, True) }}
visible: {{ data_manager.sources.links.selenium_scraper.selenium_scraper.visible | default(false, true) }}
use_for_scraping: {{ data_manager.sources.links.selenium_scraper.use_for_scraping | default(false, true) }}
selenium_class: {{ data_manager.sources.links.selenium_scraper.selenium_class | default('CERNSSOScraper', true) }}
selenium_url: {{ data_manager.sources.links.selenium_scraper.selenium_url | default('null', true) }}
selenium_class_map:
CERNSSOScraper:
class: {{ data_manager.sources.links.selenium_scraper.selenium_class_map.CERNSSOScraper.class | default('CERNSSOScraper', true) }}
kwargs:
headless: {{ data_manager.sources.links.selenium_scraper.selenium_class_map.CERNSSOScraper.kwargs.headless | default(true, true) }}
web:
enabled: {{ data_manager.sources.web.enabled | default(true, true) }}
visible: {{ data_manager.sources.web.visible | default(true, true) }}
link:
enabled: {{ data_manager.sources.web.link.enabled | default(true, true) }}
auth_provider_name: {{ data_manager.sources.web.link.auth_provider_name | default("", true) }}
schedule: '{{ data_manager.sources.web.link.schedule | default("", true) }}'
max_depth: {{ data_manager.sources.web.link.max_depth | default(3, true) }}
max_pages: {{ data_manager.sources.web.link.max_pages | default(null, true) }}
delay: {{ data_manager.sources.web.link.delay | default(1, true) }}
allow: {{ data_manager.sources.web.link.allow | default([], true) | tojson }}
deny: {{ data_manager.sources.web.link.deny | default([], true) | tojson }}
anonymize_data: {{ data_manager.sources.web.link.anonymize_data | default(false, true) }}
markitdown: {{ data_manager.sources.web.link.markitdown | default(false, true) }}
input_lists:
{%- for l in data_manager.sources.web.link.input_lists | default([], true) %}
- {{ l }}
{%- endfor %}
urls:
{%- for u in data_manager.sources.web.link.urls | default([], true) %}
- {{ u }}
{%- endfor %}
twiki:
enabled: {{ data_manager.sources.web.twiki.enabled | default(true, true) }}
auth_provider_name: {{ data_manager.sources.web.twiki.auth_provider_name | default("", true) }}
schedule: '{{ data_manager.sources.web.twiki.schedule | default("", true) }}'
max_depth: {{ data_manager.sources.web.twiki.max_depth | default(2, true) }}
max_pages: {{ data_manager.sources.web.twiki.max_pages | default(100, true) }}
delay: {{ data_manager.sources.web.twiki.delay | default(60, true) }}
allow: {{ data_manager.sources.web.twiki.allow | default([], true) | tojson }}
deny: {{ data_manager.sources.web.twiki.deny | default([], true) | tojson }}
anonymize_data: {{ data_manager.sources.web.discourse.anonymize_data | default(false, true) }}
markitdown: {{ data_manager.sources.web.twiki.markitdown | default(false, true) }}
input_lists:
{%- for list in data_manager.sources.web.twiki.input_lists | default([], true) %}
- {{ list }}
{%- endfor %}
urls:
{%- for url in data_manager.sources.web.twiki.urls | default([], true) %}
- {{ url }}
{%- endfor %}
discourse:
enabled: {{ data_manager.sources.web.discourse.enabled | default(true, true) }}
auth_provider_name: {{ data_manager.sources.web.discourse.auth_provider_name | default("cern_sso", true) }}
schedule: '{{ data_manager.sources.web.discourse.schedule | default("", true) }}'
max_pages: {{ data_manager.sources.web.discourse.max_pages | default(500, true) }}
delay: {{ data_manager.sources.web.discourse.delay | default(10, true) }}
anonymize_data: {{ data_manager.sources.web.discourse.anonymize_data | default(false, true) }}
markitdown: {{ data_manager.sources.web.discourse.markitdown | default(false, true) }}
base_url: {{ data_manager.sources.web.discourse.base_url | default("https://cms-talk.web.cern.ch", true) }}
keywords:
{%- for keyword in data_manager.sources.web.discourse.keywords | default([], true) %}
- {{ keyword }}
{%- endfor %}
category_paths:
{%- for category_path in data_manager.sources.web.discourse.category_paths | default([], true) %}
- {{ category_path }}
{%- endfor %}
git:
enabled: {{ data_manager.sources.git.enabled | default(true, true) }}
visible: {{ data_manager.sources.git.visible | default(true, true) }}
schedule: '{{ data_manager.sources.git.schedule | default("", true) }}'
sso:
enabled: {{ data_manager.sources.sso.enabled | default(true, true) }}
visible: {{ data_manager.sources.sso.visible | default(true, true) }}
schedule: '{{ data_manager.sources.sso.schedule | default("", true) }}'
urls:
{%- for u in data_manager.sources.git.urls | default([], true) %}
- {{ u }}
{%- endfor %}
jira:
enabled: {{ data_manager.sources.jira.enabled | default(true, true) }}
url: {{ data_manager.sources.jira.url | default('', true) }}
Expand Down
4 changes: 4 additions & 0 deletions src/cli/templates/dockerfiles/Dockerfile-data-manager
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,10 @@ COPY pyproject.toml pyproject.toml
COPY weblists weblists
RUN pip install --upgrade pip && pip install .

# Chromium for Python Playwright (CERN SSO in Scrapy auth middleware).
RUN python -m playwright install-deps chromium \
&& python -m playwright install chromium

RUN chmod g+rx /root; chmod -R g+w /root/archi/src/interfaces

ARG APP_VERSION=unknown
Expand Down
Loading