Skip to content

Erik's manual scraping process

Tim Loderhose edited this page Nov 23, 2021 · 1 revision

I (Tim) wrote these (mostly chronological) notes as Erik was showcasing his process.


BankTrack bank profiles

  • Visit banktrack bank profile. ie. banktrack.org/bank/bank_of_america
  • Policies tab (below about)
  • Investment policies are all listed
    • also occur in Documents tab - policies are often in documents
    • Policies are added here (in documents), tagged as 'csr policy'
    • When an old one is obsolete, it is tagged as 'out of date'
    • Also added to BT share (a virtual harddrive)

Searching and updating the profiles:

  • Bank of America

    • 60-70% Investment policies link to a CSR web page
      • otherwise 'unavailable' will be stated
    • Clicks on link (BoA)
    • CSR framework available as PDF (mostly PDF, sometimes word, sometimes html)
      • already tracked
    • using google site search
      • google: human rights site:bankofamerica.com
      • finds pdf document, downloads to get document date
    • google: policy site:bankofamerica.com
      • ...
  • ANZ

    • clicks link in BankTrack investment profile
    • finds list of policy links
    • clicks all PDFs, finds dates
    • finds updated Energy policy
    • downloads to BT share
    • adds to banktrack website
    • marks old energy policy as 'outdated'
    • google: human rights site:anz.com
  • Bank of China

    • clicks on link from investment profile
    • ! link not found
    • goes to bank website itself, linked from BankTrack - english version
    • Clicks About us
    • scrolls, finds nothing new
    • Clicks on Investor relations
    • 'they probably don't have any policies, and the ones listed are from the subsidiary' - Bank of China Hong Kong
    • google: policy site:boc.cn - nothing first page, doesn't check second
    • google: sustainability site:boc.cn - nothing first page, doesn't check second

Erik has local contacts in China to check local sites

  • but they didn't appear to have linked anything for the bank of China

  • CaixaBank

  • clicks on BT link

  • principles linking to document already logged

Bank Policies

  • BT website divided into 'banks and climate', etc.
  • Banks are scored on scale on which you can get points
  • output of our tool would be reviewed by ie. climate team

'I do site search when desperate'

  • usually when no information found easily
  • when bank website is not well-structured and informed about policies

What do you do when that link in investment profile doesn't exist?

  • Answer: site search Are document origins (URLs) stored?
  • Answer: no, documents are downloaded to BT share
  • in case document is HTML, link to the page You say "sometimes I do this" - there is no protocol?
  • Answer: not really, intuition is built, but search is not exhaustive
Clone this wiki locally