-
Notifications
You must be signed in to change notification settings - Fork 0
Manual scraping of Top 60 banks
- Citi
- Crédit Agricole
- China Construction Bank
- China CITIC Bank
- Commerzbank
- Bank of Montreal
- Links
- Files
- Barclays
- Links
- Files
- BBVA
- Links
- Files
- BNP Paribas
- Links
- Files
- Banco Santander
- Links
- Files
- General comments
Fairly standard HTML, no excessive JS. Definitely scrapable with bs4. Main document type: pdf
Under the sustainability page, the Policies/Evolution/Net-zero 'tabs' are not
tabs at all, but load a new .htm
view of the site.
Citi uses relative links.
Links are discoverable in HTML with standard tags:
<a href="/citi/sustainability/data/Environmental-and-Social-Policy-Framework.pdf">Environmental and Social Policy Framework</a>
Links are often suffixed with #page=XX
, so we'll have to take care to remove
those.
Outdated links on main page, ie: https://www.citigroup.com/citi/sustainability/ has a link that reads "Read our most recent ESG Report", but that links to the one from 2019, not 2020 (which is discoverable through https://www.citigroup.com/citi/investor/corporate_governance.html#Environmental-and-Social-Information)
Useful sites: https://www.citigroup.com/citi/investor/corporate_governance.html - Citi Policies
- careful to scrape for pdf files, as there are ie. CoCs in many languages on this site.
- HTML includes note on when documents were created/last updated/revised
Many pages here link to the same pdfs, so we should definitely check not to download the same document several times.
The ESP framework pdf has good pdf metadata we can use, specifically with created/modified at tags.
https://www.credit-agricole.com/en/responsible-and-committed/our-csr-strategy-be-an-actor-of-a-sustainable-society/our-sector-policies https://www.ca-cib.com/about-us/committed-and-responsible/our-sustainable-financing-policy
Standard HTML, but not very nicely structured. Nested lists and divs that include hidden navigation elements.
Main doctype: pdf
Ugly struture:
<div class="listing-bloc-actionnaire col-md-4 col-xs-12 pB-5 pL-0"> <div class="pL-0 pB-15 pR-15 dividende-1k">
<div class="border-4-all clearfix dividende-1k-content">
<div class="CA-push-title text-center">
<h2 class="bold padding-0"><span class="ezstring-field">Transportation</span></h2>
<div class="pT-3 mT-10 same-height-height-dividende pR-10 pL-10 ezxmltext-field">
<ul><li><a href="/en/pdfPreview/173428" target="_blank" title="CSR - Aviation sector policies">Aviation</a></li><li><a href="/en/pdfPreview/173429" target="_blank" title="CSR - Maritime sector policies">Shipping</a></li><li><a href="/en/pdfPreview/173430" target="_blank" title="CSR - Automotive sector policies">Automotive</a></li></ul>
</div>
</div>
</div>
</div>
Links don't include pdf file ending on the main website:
<a href="/en/pdfPreview/173428" target="_blank" title="CSR - Aviation sector policies">Aviation</a>
On the ca-cib.com website, links look like this:
<li class="text-align-justify">Mining (<a href="/sites/default/files/2020-03/Politique_Sectorielle_Groupe_Mines_m%C3%A9taux_Mars_2020-EN.pdf" target="_blank">Metals and Mining</a>)</li>
and appear to link to the same, or outdated, sector policies as indexed on the main website, but with different document names.
Pdf file links named by integer, but documents have proper name. Those I opened were .doc or .docx documents exported to pdf, as visible in their metadata (and in the document footer).
No link to proper site from BankTrack. Checking their website yields some possibly relevant links: http://en.ccb.com/en/investorv3/greenbond/zzgg/list_1.html (Green, social and sustainability bond) http://en.ccb.com/en/investorv3/greenbond/20210429_1619687957.html
http://en.ccb.com/en/investorv3/hottopic/list_1.html?ptId=10 - not yet in BT I believe.
It's all standard HTML that should be easy to scrape.
<dd><a href ="/en/investor/20210429_1619687957/20210429171923851497.pdf" title="1.2020 CCB GSS Bond Annual Report ">1.2020 CCB GSS Bond Annual Report </a></dd>
Standard pdf
No proper CSR website. http://www.citicbank.com/about/introduction/socialresponsibility/2020/ links to CSR documents, but it's in Chinese so I'm not sure it's really relevant.
http://www.citicbank.com/about/introduction/socialresponsibility/2020/ has a new tab that needs to be clicked per year, so updating this would mean adding the new year when available. In general, the code here seems very 'custom' (and bad).
The actual link to the pdf is inside of inline JS (not visible here):
</style>
<ul class="gz_li">
<a href="https://www.citicbank.com/about/introduction/socialresponsibility/202106/P020210611558203406261.pdf" target="_blank""><li><p class="gz_p1">2020-12-31</p><p class="gz_p2"> 中信银行股份2020年度可持续发展报告-中文版</p></li></a>
<a href="https://www.citicbank.com/about/introduction/socialresponsibility/202106/P020210611557920944720.pdf" target="_blank""><li><p class="gz_p1">2020-12-31</p><p class="gz_p2"> 中信银行股份2020年度可持续发展报告-英文版</p></li></a>
</ul>
<style>
https://www.citicbank.com/about/introduction/socialresponsibility/202106/P020210611558203406261.pdf
standard pdf, working metadata.
https://www.commerzbank.de/en/nachhaltigkeit/nachhaltigkeitsstandards/positionen_und_richtlinien/positionen_und_richtlinien.html https://www.commerzbank.de/en/nachhaltigkeit/index.html
Basic website that is scrapable.
Detailed link with .pdf, alongside a fileinfo span that includes filesize. Link inlaid in main body text.
<a href="/media/nachhaltigkeit/nfe/Commerzbank_NFR_2020.pdf" title="commerzbank.com/NFR2019" alt="Combined separate non-financial report 2020 (PDF, 249)" target="_blank" class="_blank">Combined separate non-financial report 2020</a><span class="fileinfo"> (<acronym xml:lang="en" title="Portable Document Format">PDF</acronym>, 249 <acronym xml:lang="en" title="Kilobyte">kB</acronym>)</span>
Pdf exported from Word with adequate metadata.
Banktrack linked page - https://our-impact.bmo.com/?ecid=corpresp
Reports page accesible through:
<a href="/reports/" title="Reports" class="tl-header-mega"><span>Reports</span></a>
PDF links end with .pdf and discoverable through href.
<a id="" class="link" alt="Download 2020 Climate Report" target="_blank" title="2020 Climate Report PDF"
href="https://our-impact.bmo.com/wp-content/uploads/2021/03/BMO-2020-Climate-Report-2.pdf">Download Report (1.2 MB)</a>
Good metadata - title, created at, modified at
Banktrack linked page - https://home.barclays/society/esg-resource-hub/statements-and-policy-positions/
PDF report files are available directly on the linked page.
PDF links end with .pdf and discoverable through href.
<a title="Opens in a new window"
href="/content/dam/home-barclays/documents/citizenship/our-reporting-and-policy-positions/Forestry-and-Agricultural-Commodities-Statement.pdf"
target="_blank">Forestry and Agricultural Commodities (148KB)</a>
Found inside a pretty nested structure though.
Created at and modified at metadata is available, but titles are not very informative e.g. "Barclays Report"
Banktrack policy link - https://shareholdersandinvestors.bbva.com/sustainability-and-responsible-banking/
Report file page is accessible through:
<a href="https://shareholdersandinvestors.bbva.com/sustainability-and-responsible-banking/presentaciones-e-informes/">Presentations and reports</a>
#2020
is appended once on the page.
PDF links end with .pdf and discoverable through href.
<a href="https://shareholdersandinvestors.bbva.com/wp-content/uploads/2021/06/TCFD-Report-Dec20_Eng.pdf" target="_blank"
class="cta-primary blue">Download PDF</a>
Struggled to find other policies that Banktrack already stores, not even by using the bank site search feature. Had to use site search using Google.
Created at and modified at metadata is available, but not title.
Banktrack linked page - https://group.bnpparibas/en/group/bnp-paribas-a-leader-in-sustainable-finance
A lot of policy descriptions on the banktrack link page.
The sector policy reports are available through
<a href="https://group.bnpparibas/en/financing-investment-policies" target="_blank"
class="greenbox-link">Our sector policies</a>
Struggled to find other policies that Banktrack already stores without using google site search.
PDF links end with .pdf and discoverable through href.
<a class="greenbox-link"
href="https://group.bnpparibas/uploads/file/csr_sector_policy_defence_final_2017_en_v_3.pdf"
target="_blank">Our Defence sector policy</a>
Created at and modified at metadata is available, but not title. At least some are doc files converted to PDF.
Banktrack linked page - https://www.santander.com/es/nuestro-compromiso/politicas
Original page is in Spanish. The english version is accessible through - https://www.santander.com/en/nuestro-compromiso/politicas
A number of PDF report links are available on this page already.
PDF links end with .pdf and discoverable through href.
<a title="Abre en ventana nueva"
href="/content/dam/santander-com/es/contenido-paginas/nuestro-compromiso/pol%C3%ADticas/do-Pol%C3%ADtica%20de%20derechos%20humanos-es.pdf"
target="_blank">Política de derechos humanos</a>
Good metadata - title, created at, modified at. Though file name and titles are in spanish
-
Should we follow only internal links? Maybe check a regex to see if it's a related website?
-
Some banks have many links - we should probably set some threshold not to scrape too many sites, which will inevitably bring up many points in the differ.
-
may need per-site regex to find documents
- get main doctype and the way its links are structured
-
Files (and different versions thereof) can be linked to from many locations, and link URLs may be different - we may want to have additional checks to see if a document is supposed to be the same, or group/cluster documents based on metadata.
-
Seems like there is one page containing report files, either the page banktrack links or a page that can be found by matching for “report”, “policy”, “statement”. However these pages do not seem to have all reports that banktrack stores.