Web scraping project to collect all brazilian presidents speeches from Biblioteca da presidência (ex-presidents) and Discursos do Planalto (actual president) for further data analysis.
- Node.js 12+ version
- NPM 6+ version
git clonethe project- Install all dependencies using
npm install - Run the main project using:
node index.jsfor past presidents (before Bolsonaro)- They will have this folder pattern: pdfs/fernandocollor/1990/01.pdf
- Run the
bolsonaro.jsproject (node bolsonaro.js) to collect all Bolsonaro speeches
- After running the main and bolsonaro files, approximately 80% of the data collected was pdf. Use the
pdf-to-txt.jsto extract the text data from the pdf file. - Use the
rename-files.jsto rename the files to a certain pattern (such as cafeFilho10.txt, meaning the 11th (because starts with 0) Café Filho speech).