Skip to content

Commit 217b654

Browse files
authored
Remove ^ from Regex for Wiki links
The links being scraped are now absolute (full) URLs. The rule should look for links that contain `/wiki/` rather than start with `/wiki/`. Ex: https://en.wikipedia.org/wiki/Benevolent_dictator_for_life rather than /wiki/Benevolent_dictator_for_life'
1 parent 80c343b commit 217b654

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

Chapter05_Scrapy/wikiSpider/wikiSpider/articlesMoreRules.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ class ArticleSpider(CrawlSpider):
66
allowed_domains = ['wikipedia.org']
77
start_urls = ['https://en.wikipedia.org/wiki/Benevolent_dictator_for_life']
88
rules = [
9-
Rule(LinkExtractor(allow='^(/wiki/)((?!:).)*$'), callback='parse_items', follow=True, cb_kwargs={'is_article': True}),
9+
Rule(LinkExtractor(allow='(/wiki/)((?!:).)*$'), callback='parse_items', follow=True, cb_kwargs={'is_article': True}),
1010
Rule(LinkExtractor(allow='.*'), callback='parse_items', cb_kwargs={'is_article': False})
1111
]
1212

0 commit comments

Comments
 (0)