-
Notifications
You must be signed in to change notification settings - Fork 0
Scraper & differ notes
LSturtew edited this page Dec 6, 2021
·
5 revisions
-
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Last-Modified
- Check in what cases this works to see if the website as been updated at all (could be a simple flag to skip a website)
-
https://realpython.com/beautiful-soup-web-scraper-python/
- Tutorial on webscraping in Python using Beautiful Soup
-
https://www.selenium.dev/documentation/
- For JS-heavy websites we might need to use a proper browser for scraping - selenium allows this
- can we assume that all content we're scraping is static?
- could we run into trouble because of JavaScript?
- ie., some sites may require JS to even get to the content - we need to check this for all websites
- Bank has one big PDF with different types of policies, but later splits it into different documents. How to manage this on our side?
- Policy is moved to a different location and we catch an exception during scraping. Do we alert Bank Track in this case?
- Banks have docs for different kinds of policies, i.e. an energy policy and a human rights policy. So how do we match old energy policy to new energy policy to diff check?
- Can we mimic google search, with a set of keywords? Or is this out of scope.
Using the request and beautiful soup libraries you can parse the html page of a website. It is important to have some time between requests, to not overload the server e.g. wait 1-2 seconds.
import requests
from bs4 import BeautifulSoup
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
To get a list of all the urls on a webpage:
url_list = []
for link in soup.find_all("a", href=True):
url_list.append(link.get("href"))
This can be expanded with some logic to filter out certain urls or to get only urls that lead to a PDF.
Google site search: for this you need the base url to the banks website and a search word. This will return a list of urls (the list of urls you would find on the first page of your google search. The number of urls it returns can be adjusted).
from googlesearch import search
search(f"site:{base_url} {search_word}")