[Ref] Integrate Scrapy-based scrapers into Archi interfaces#547
[Ref] Integrate Scrapy-based scrapers into Archi interfaces#547nausikt wants to merge 55 commits intoarchi-physics:devfrom
Conversation
…C via parsers, introduce extension points parse_item, parse_follow_links, default implementation, toscrape and twiki example.
…ble crawler args.
…both toscrape, twiki examples.
… for toscrape, twiki.
…-world twiki crawling contracts with saftest default values.
…orks in HeavyIon use-cases.
…ace & instantiations to GitManager and Scrapy's ScraperManager.
…ers, wired to legacy interface.
…s, refactor interface, GitManager works, link, twiki public + sso sources example.
… if not main thread, bugs scraper_manager not aware of enabled flag.
…as doms convert to markitdown later.
… vectorsotre manager.
…sible, should refactoring later to have structure/unstructure redact more separately.
|
Note that we still lack the For now, we can work around this by testing MIT sources and CERN sources in a mutually exclusive manner. |
|
Hi @nausikt , thanks for this PR! Looking into it and testing now -- one thing I noticed right away, please add any new packages to the |
|
Oh dear, let me fix it now! |
| urls: | ||
| - https://ppc.mit.edu/news/ | ||
| max_depth: 2 | ||
| max_pages: 100 | ||
| delay: 10 | ||
| markitdown: true | ||
| input_lists: | ||
| - examples/deployments/basic-scraping/miscellanea.list |
There was a problem hiding this comment.
so you can both pass urls directly in the config, or via a .list file with urls as before? For the latter case, the file can have both public and sso, without any need for a prefix, right?
There was a problem hiding this comment.
Exactly Pietro! for all scrapers (spiders) by design.
| git: | ||
| urls: | ||
| - https://github.com/dmwm/CRABServer | ||
| - https://github.com/dmwm/CRABClient |
There was a problem hiding this comment.
unlike web.links, web.git and web.twiki only take the urls directly in the config? if you can also support an input_list here, would be nice for people who want to keep it separate.
Also, is there a reason twiki takes a separate list? I understand the specific configuration options it needs, but why can't the twiki urls live in same list as other links and twiki options would be under web.links.twiki? to support different delay times more easily maybe?
Comment: I think it's better to have, e.g., links.list and git.list, as it would be in this case, than how we had it before with the prefixes, so separate input lists is good.
There was a problem hiding this comment.
unlike
web.links,web.gitandweb.twikionly take the urls directly in the config? if you can also support aninput_listhere, would be nice for people who want to keep it separate.
For web.git now, Yes only take urls. Roger that, will support input_list there!
There was a problem hiding this comment.
web.link, web.twiki and web.discourse support both urls and input_list.
There was a problem hiding this comment.
Also, is there a reason twiki takes a separate list? I understand the specific configuration options it needs, but why can't the twiki urls live in same list as other links and twiki options would be under
web.links.twiki? to support different delay times more easily maybe?
@pmlugato Got your point! agree on UX can be better by not taking separate list, will converging!🫡
For clarity, only 1web.links.input_lists as a main portal, site/scraper specific asides under web.links.<site|spider> pattern.
web:
links:
### Global Default Link configs go here, implicitly set here
max_depth: 2
max_pages: 100
delay: 10
markitdown: true
input_lists: # Only 1 portal, we pour every links here... as long as it fits scraper nature.
- examples/deployments/basic-scraping/miscellanea.list
### Site/Spider non-list-related/specific configuration goes belows.
# <spider/site>:
twiki:
delay: 60 # <---- override global
deny: ....
allow: ....
# .... w/o list
discourse:
category_paths:
- ....
delay: 10
keywords: ...
indico: # <--- upcoming IncdioSpider will stay in this level as well.
keywords:
Then behind the scene, Archi scraper_manager automatically aware of Spider and there config via newly introduced registry instead.
DOMAIN_SPIDER_REGISTRY = {
"twiki.cern.ch": TWikiSpider,
"cms-talk.web.cern.ch": DiscourseSpider,
"indico.cern.ch": IndicoSpider,
}Is this what you had in mind? but this registry has to be configured by users somewhere anyhow🤔
Otherwise, how can scraper_manager resolved link and know source_kind from flat list of urls? but i don't like prefix approach! IMO it's deteriorate UX.
There was a problem hiding this comment.
How about just added domain for scraper_manager to pick-up as registry!
links:
input_lists:
- ....
twiki:
domain: "twiki.cern.ch"
delay: 60 # <---- override global
deny: ....
allow: ....
# .... w/o list
discourse:
domain: "cms-talk.web.cern.ch"
- ....
delay: 10
keywords: ...
indico: # <--- upcoming IncdioSpider will stay in this level as well.
domain: "indico.cern.ch"
keywords:There was a problem hiding this comment.
This way we give users best UX, resolving spider are transparent and have least footprint to the users's config.
with probably only caveats[?] for now, like we can't have 2 twiki/discourse/indico source instances at the same time.
indico:
domain: "indico.cern.ch", "indico.mit.edu"
keywords: ...There was a problem hiding this comment.
Comment: I think it's better to have, e.g.,
links.listandgit.list, as it would be in this case, than how we had it before with the prefixes, so separate input lists is good.
Ack for the comment,
BTW, about naming,
- would you prefer me to rename
git.input_lists->git.listas well? or avoidlistwords and remain the same? - would you like me to completely drop
git.urls,web.links.urlsand just change toinput_listsstyle to avoid confusing users? preserve only 1 standard way?.
- Although, personally,
web.urlsare fairly convenience for me to peek/edit everything with just a glance Archi in the sameconfig.yamlbut it may be just better UX for debugging things*
| if not isinstance(sources_section, dict): | ||
| continue | ||
| web = sources_section.get("web", {}) or {} | ||
| if not isinstance(web, dict): | ||
| continue | ||
| for spider_key, sub in web.items(): | ||
| if spider_key in _WEB_TOP_LEVEL_STATIC_KEYS: | ||
| continue | ||
| if not isinstance(sub, dict): | ||
| continue | ||
| wlists = sub.get("input_lists") or [] | ||
| if isinstance(wlists, list): | ||
| collected.extend(wlists) |
There was a problem hiding this comment.
@pmlugato input_lists are understood by all scrapers (Spider) here and no prefix like sso-, eos-, or git-* is required in any urls.
For SSO-protected urls, though, we must explicitly provide the right auth_provider_name: cern_sso|.... ourself. and the SSO-protected url has to come first, but by design this should not be a caveat. URLs should works in any order!
|
@nausikt thanks a lot for this PR, it looks really nice! I have tested it with a few different source types and configurations and everything seems to be working smoothly which is great! Thanks a lot also for the active iteration offline :) I would be happy to merge this soon into dev, maybe we can have one more person look at it but can also be done after the fact... One request before doing so is the following: if you could write some nice documentation about all of this in the Thanks a lot again for all the hard work! |
|
@pmlugato Thanks to you guys for having me on board! One last thing! i've summarized all my micro-repsonses for you to finalize here. web:
links:
### Global Default Link configs go here, implicitly set here
max_depth: 2
max_pages: 100
delay: 10
markitdown: true
# Only 1 input_lists portal, we pour every links here..
input_lists:
- examples/deployments/basic-scraping/miscellanea.list
### Site/Spider non-list-related/specific configuration goes belows.
# <spider/site>:
twiki:
domain: "twiki.cern.ch"
delay: 60 # <---- override global
deny: ....
allow: ....
# .... w/o list
discourse:
domain: "cms-talk.web.cern.ch"
- ....
delay: 10
keywords: ...
indico: # <--- upcoming IncdioSpider will stay in this level as well.
domain: "indico.cern.ch"
keywords:
|
|
Agreed with what we discussed offline! converging... |
Resolve #546 #464
There's only one culprit for SSO playwright + chromium to work as expected. please beware of ...
End-to-End Integrated Archi Infrastructure Test
The main caveat is when seed URLs mix SSO-protected and public pages, such as TWiki.
In those cases, please list the SSO-protected seed URLs First, followed by the remaining public/SSO-mixed URLs.
Please, feels free to adjust the
example/deployment/basic-scraping/config.yamlas you see fit, Currently, it might take very long time to pass through every web sources.dmwm/CRABServerandDMWM/CRABClientExpected Result


[1] All comprehensive sources are ingesting which might take hrs to finished.
[2] Also, most of our docs should be in anonymized, markdown format right away.
P.S. been anonymized as best as NLP module and our heuristic-based (hard-code) name-replacement patterns per source can.
Standalone Scrapy Test
more details are comming soon...