Skip to content

Scraper & differ notes

LSturtew edited this page Dec 6, 2021 · 5 revisions

Useful links

Notes

Static vs. dynamic sites

  • can we assume that all content we're scraping is static?
  • could we run into trouble because of JavaScript?
    • ie., some sites may require JS to even get to the content - we need to check this for all websites

Potential edge cases to keep in mind

  • Bank has one big PDF with different types of policies, but later splits it into different documents. How to manage this on our side?
  • Policy is moved to a different location and we catch an exception during scraping. Do we alert Bank Track in this case?

Thinking out loud

  • Banks have docs for different kinds of policies, i.e. an energy policy and a human rights policy. So how do we match old energy policy to new energy policy to diff check?
  • Can we mimic google search, with a set of keywords? Or is this out of scope.

Web scraping with python

Using the request and beautiful soup libraries you can parse the html page of a website. It is important to have some time between requests, to not overload the server e.g. wait 1-2 seconds.

import requests
from bs4 import BeautifulSoup

page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")

To get a list of all the urls on a webpage:

url_list = []
for link in soup.find_all("a", href=True):
    url_list.append(link.get("href"))

This can be expanded with some logic to filter out certain urls or to get only urls that lead to a PDF.

Google site search: for this you need the base url to the banks website and a search word. This will return a list of urls (the list of urls you would find on the first page of your google search. The number of urls it returns can be adjusted).

from googlesearch import search

search(f"site:{base_url} {search_word}")