Scraper & differ notes

Useful links

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Last-Modified
- Check in what cases this works to see if the website as been updated at all (could be a simple flag to skip a website)
https://realpython.com/beautiful-soup-web-scraper-python/
- Tutorial on webscraping in Python using Beautiful Soup
https://www.selenium.dev/documentation/
- For JS-heavy websites we might need to use a proper browser for scraping - selenium allows this

Notes

Static vs. dynamic sites

can we assume that all content we're scraping is static?
could we run into trouble because of JavaScript?
- ie., some sites may require JS to even get to the content - we need to check this for all websites

Potential edge cases to keep in mind

Bank has one big PDF with different types of policies, but later splits it into different documents. How to manage this on our side?
Policy is moved to a different location and we catch an exception during scraping. Do we alert Bank Track in this case?

Thinking out loud

Banks have docs for different kinds of policies, i.e. an energy policy and a human rights policy. So how do we match old energy policy to new energy policy to diff check?
Can we mimic google search, with a set of keywords? Or is this out of scope.

Web scraping with python

Using the request and beautiful soup libraries you can parse the html page of a website. It is important to have some time between requests, to not overload the server e.g. wait 1-2 seconds.

import requests
from bs4 import BeautifulSoup

page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")

To get a list of all the urls on a webpage:

url_list = []
for link in soup.find_all("a", href=True):
    url_list.append(link.get("href"))

This can be expanded with some logic to filter out certain urls or to get only urls that lead to a PDF.

Google site search: for this you need the base url to the banks website and a search word. This will return a list of urls (the list of urls you would find on the first page of your google search. The number of urls it returns can be adjusted).

from googlesearch import search

search(f"site:{base_url} {search_word}")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly