Skip to content

Manual scraping of Top 60 banks

Tim Loderhose edited this page Dec 7, 2021 · 3 revisions

Table of Contents

  1. Citi
    1. Links
    2. Files
  2. Crédit Agricole
    1. Links
    2. Files
  3. China Construction Bank
    1. Links
    2. Files
  4. China CITIC Bank
    1. Links
    2. Files
  5. Commerzbank
    1. Links
    2. Files
  6. Bank of Montreal
    1. Links
    2. Files
  7. Barclays
    1. Links
    2. Files
  8. BBVA
    1. Links
    2. Files
  9. BNP Paribas
    1. Links
    2. Files
  10. Banco Santander
    1. Links
    2. Files
  11. General comments

Citi

Fairly standard HTML, no excessive JS. Definitely scrapable with bs4. Main document type: pdf

Under the sustainability page, the Policies/Evolution/Net-zero 'tabs' are not tabs at all, but load a new .htm view of the site.

Links

Citi uses relative links.

Links are discoverable in HTML with standard tags:

<a href="/citi/sustainability/data/Environmental-and-Social-Policy-Framework.pdf">Environmental and Social Policy Framework</a>

Links are often suffixed with #page=XX, so we'll have to take care to remove those.

Outdated links on main page, ie: https://www.citigroup.com/citi/sustainability/ has a link that reads "Read our most recent ESG Report", but that links to the one from 2019, not 2020 (which is discoverable through https://www.citigroup.com/citi/investor/corporate_governance.html#Environmental-and-Social-Information)

Useful sites: https://www.citigroup.com/citi/investor/corporate_governance.html - Citi Policies

  • careful to scrape for pdf files, as there are ie. CoCs in many languages on this site.
  • HTML includes note on when documents were created/last updated/revised

Files

Many pages here link to the same pdfs, so we should definitely check not to download the same document several times.

The ESP framework pdf has good pdf metadata we can use, specifically with created/modified at tags.

Crédit Agricole

https://www.credit-agricole.com/en/responsible-and-committed/our-csr-strategy-be-an-actor-of-a-sustainable-society/our-sector-policies https://www.ca-cib.com/about-us/committed-and-responsible/our-sustainable-financing-policy

Standard HTML, but not very nicely structured. Nested lists and divs that include hidden navigation elements.

Main doctype: pdf

Ugly struture:

<div class="listing-bloc-actionnaire col-md-4 col-xs-12 pB-5 pL-0">    <div class="pL-0 pB-15 pR-15 dividende-1k">
    <div class="border-4-all clearfix dividende-1k-content">
        <div class="CA-push-title text-center">
            <h2 class="bold padding-0"><span class="ezstring-field">Transportation</span></h2>
            <div class="pT-3 mT-10 same-height-height-dividende pR-10 pL-10 ezxmltext-field">
                <ul><li><a href="/en/pdfPreview/173428" target="_blank" title="CSR - Aviation sector policies">Aviation</a></li><li><a href="/en/pdfPreview/173429" target="_blank" title="CSR - Maritime sector policies">Shipping</a></li><li><a href="/en/pdfPreview/173430" target="_blank" title="CSR - Automotive sector policies">Automotive</a></li></ul>


            </div>

        </div>
    </div>
</div>

Links

Links don't include pdf file ending on the main website:

<a href="/en/pdfPreview/173428" target="_blank" title="CSR - Aviation sector policies">Aviation</a>

On the ca-cib.com website, links look like this:

<li class="text-align-justify">Mining (<a href="/sites/default/files/2020-03/Politique_Sectorielle_Groupe_Mines_m%C3%A9taux_Mars_2020-EN.pdf" target="_blank">Metals and Mining</a>)</li>

and appear to link to the same, or outdated, sector policies as indexed on the main website, but with different document names.

Files

Pdf file links named by integer, but documents have proper name. Those I opened were .doc or .docx documents exported to pdf, as visible in their metadata (and in the document footer).

China Construction Bank

No link to proper site from BankTrack. Checking their website yields some possibly relevant links: http://en.ccb.com/en/investorv3/greenbond/zzgg/list_1.html (Green, social and sustainability bond) http://en.ccb.com/en/investorv3/greenbond/20210429_1619687957.html

http://en.ccb.com/en/investorv3/hottopic/list_1.html?ptId=10 - not yet in BT I believe.

It's all standard HTML that should be easy to scrape.

Links

<dd><a href ="/en/investor/20210429_1619687957/20210429171923851497.pdf" title="1.2020 CCB GSS Bond Annual Report ">1.2020 CCB GSS Bond Annual Report </a></dd>

Files

Standard pdf

China CITIC Bank

No proper CSR website. http://www.citicbank.com/about/introduction/socialresponsibility/2020/ links to CSR documents, but it's in Chinese so I'm not sure it's really relevant.

Links

http://www.citicbank.com/about/introduction/socialresponsibility/2020/ has a new tab that needs to be clicked per year, so updating this would mean adding the new year when available. In general, the code here seems very 'custom' (and bad).

The actual link to the pdf is inside of inline JS (not visible here):

</style>
<ul class="gz_li">

    <a href="https://www.citicbank.com/about/introduction/socialresponsibility/202106/P020210611558203406261.pdf" target="_blank""><li><p class="gz_p1">2020-12-31</p><p class="gz_p2"> 中信银行股份2020年度可持续发展报告-中文版</p></li></a>

    <a href="https://www.citicbank.com/about/introduction/socialresponsibility/202106/P020210611557920944720.pdf" target="_blank""><li><p class="gz_p1">2020-12-31</p><p class="gz_p2"> 中信银行股份2020年度可持续发展报告-英文版</p></li></a>

</ul>
<style>

Files

https://www.citicbank.com/about/introduction/socialresponsibility/202106/P020210611558203406261.pdf

standard pdf, working metadata.

Commerzbank

https://www.commerzbank.de/en/nachhaltigkeit/nachhaltigkeitsstandards/positionen_und_richtlinien/positionen_und_richtlinien.html https://www.commerzbank.de/en/nachhaltigkeit/index.html

Basic website that is scrapable.

Links

Detailed link with .pdf, alongside a fileinfo span that includes filesize. Link inlaid in main body text.

<a href="/media/nachhaltigkeit/nfe/Commerzbank_NFR_2020.pdf" title="commerzbank.com/NFR2019" alt="Combined separate non-financial report 2020 (PDF, 249)" target="_blank" class="_blank">Combined separate non-financial report 2020</a><span class="fileinfo"> (<acronym xml:lang="en" title="Portable Document Format">PDF</acronym>, 249 <acronym xml:lang="en" title="Kilobyte">kB</acronym>)</span>

Files

Pdf exported from Word with adequate metadata.

Bank of Montreal

Banktrack linked page - https://our-impact.bmo.com/?ecid=corpresp

Reports page accesible through:

<a href="/reports/" title="Reports" class="tl-header-mega"><span>Reports</span></a>

Links

PDF links end with .pdf and discoverable through href.

<a id="" class="link" alt="Download 2020 Climate Report" target="_blank" title="2020 Climate Report PDF" 
href="https://our-impact.bmo.com/wp-content/uploads/2021/03/BMO-2020-Climate-Report-2.pdf">Download Report (1.2 MB)</a>

Files

Good metadata - title, created at, modified at

Barclays

Banktrack linked page - https://home.barclays/society/esg-resource-hub/statements-and-policy-positions/

PDF report files are available directly on the linked page.

Links

PDF links end with .pdf and discoverable through href.

<a title="Opens in a new window" 
href="/content/dam/home-barclays/documents/citizenship/our-reporting-and-policy-positions/Forestry-and-Agricultural-Commodities-Statement.pdf" 
target="_blank">Forestry and Agricultural Commodities (148KB)</a>

Found inside a pretty nested structure though.

Files

Created at and modified at metadata is available, but titles are not very informative e.g. "Barclays Report"

BBVA

Banktrack policy link - https://shareholdersandinvestors.bbva.com/sustainability-and-responsible-banking/

Report file page is accessible through:

<a href="https://shareholdersandinvestors.bbva.com/sustainability-and-responsible-banking/presentaciones-e-informes/">Presentations and             reports</a>

#2020 is appended once on the page.

Links

PDF links end with .pdf and discoverable through href.

<a href="https://shareholdersandinvestors.bbva.com/wp-content/uploads/2021/06/TCFD-Report-Dec20_Eng.pdf" target="_blank" 
class="cta-primary blue">Download PDF</a>

Struggled to find other policies that Banktrack already stores, not even by using the bank site search feature. Had to use site search using Google.

Files

Created at and modified at metadata is available, but not title.

BNP Paribas

Banktrack linked page - https://group.bnpparibas/en/group/bnp-paribas-a-leader-in-sustainable-finance

A lot of policy descriptions on the banktrack link page.

The sector policy reports are available through

<a href="https://group.bnpparibas/en/financing-investment-policies" target="_blank" 
class="greenbox-link">Our sector policies</a>

Struggled to find other policies that Banktrack already stores without using google site search.

Link

PDF links end with .pdf and discoverable through href.

<a class="greenbox-link" 
href="https://group.bnpparibas/uploads/file/csr_sector_policy_defence_final_2017_en_v_3.pdf" 
target="_blank">Our Defence sector policy</a>

Files

Created at and modified at metadata is available, but not title. At least some are doc files converted to PDF.

Banco Santander

Banktrack linked page - https://www.santander.com/es/nuestro-compromiso/politicas

Original page is in Spanish. The english version is accessible through - https://www.santander.com/en/nuestro-compromiso/politicas

A number of PDF report links are available on this page already.

Link

PDF links end with .pdf and discoverable through href.

<a title="Abre en ventana nueva" 
href="/content/dam/santander-com/es/contenido-paginas/nuestro-compromiso/pol%C3%ADticas/do-Pol%C3%ADtica%20de%20derechos%20humanos-es.pdf"    
target="_blank">Política de derechos humanos</a>

Files

Good metadata - title, created at, modified at. Though file name and titles are in spanish

General comments

  • Should we follow only internal links? Maybe check a regex to see if it's a related website?

  • Some banks have many links - we should probably set some threshold not to scrape too many sites, which will inevitably bring up many points in the differ.

  • may need per-site regex to find documents

    • get main doctype and the way its links are structured
  • Files (and different versions thereof) can be linked to from many locations, and link URLs may be different - we may want to have additional checks to see if a document is supposed to be the same, or group/cluster documents based on metadata.

  • Seems like there is one page containing report files, either the page banktrack links or a page that can be found by matching for “report”, “policy”, “statement”. However these pages do not seem to have all reports that banktrack stores.