Version: 2.1, 20140927
Author: David Prager Branner
Note: 20160712 was the last day this code worked. It appears Yahoo changed the structure of its web pages after that date, so the scraping tools will have to be revised.
To explore web scraping using the headline news service on http://finance.yahoo.com.
See the PDF file in the output directory for a sample.
All code is tested with Python v. 3.4 only.
-
headline_to_db.py
: Newly revised with this version. Preliminary version ofstockscrape.py
modified to save unique headlines and associated information to a SQLite database. There will be a separatedb_to_latex.py
program to produce LaTeX output from the database, but it has not been implemented yet. -
stockscrape.py
: This is the principal scraping program. It has not been updated since v. 1.2. -
headline_length.py
: Trimmed version ofstockscrape.py
for determining the longest attested headline, link, and news-source, in preparation for creating database fields. Default list of stock tickers is the same as forstockscrape.py
; several others are found inDATA/
.
OUTPUT/
: where output file, stock_report.tex
is saved. Note that this file needs to be compiled with LaTeX in order to be usable. The LaTeX output file stock_report.pdf
will normally be found here as well; stock_report_sample.tex
and stock_report_sample.pdf
are samples, but the actual stock_report.tex
and stock_report.pdf
included in this repository are encrypted.
CODE/
: contains .tex
templates for beginning and end of output file.
DATA/
: contains .txt
file of stock tickers to be looked up, one per line. Rename stock_list_sample.txt
to stock_list.txt
to run the code with an actual example. (The file already entitled stock_list.txt
and included in this repository is encrypted.)
Scripts for use with headline_to_db.py
(to be run in the following order):
create_table.sqlscript
: Creates the three tables currently in the database.insert_all_tickers.sqlscript
: Populates the database with stock and fund tickers.
Instructions for running these scripts are found in their headers.
Note that before running headline_to_db.py
, it is best to run sqlite3
and empty the headlines
table, otherwise the LaTeX output will not populate correctly. Use the following two commands at the sqlite3
prompt:
DELETE FROM headlines; SELECT ticker,headline FROM headlines;
This will ensure that the table is empty.
Output of stockscrape.py
contains first a table of stock prices and related data, from the Yahoo API, followed by a list of recent headlines for each stock ticker. The output is a .tex
file, while must be compiled to produce human-readlable output. The LaTeX package longtable
is used, to allow breaking of tables across pages if they exceed the amount of available space on the first page.
This version scrapes the Yahoo financial news site using Beautiful Soup 4.
- Converted to class structure.
- Enabled
-v
flag for verbose output, if run from commend line. - Headlines now have clickable URLs attached to them.
- Output to STDOUT and PDF both somewhat compressed now.
- Numerous other small changes.
- Actual data files encrypted.
- Phase 1: ( Finished) Basic scraper and LaTeX output of API-data and scraped headlines.
- Phase 2: Now underway: Use SQL to store and manipulate data.
- Phase 3: Use a web framework to create a GUI for easy management of the application.
- Separate download of data from storage to database. Database was originally not committed in repository, but is being committed as part of an interim experiment. Instead, downloaded data should be saved to a text file; the database should be recoverable from these text files.
Past versions:
- 2.0, 20130405
- 1.5, 20130313
- 1.42, 20130304
- 1.41, 20130228
- 1.4, 20130227
- 1.3, 20130225
- 1.2, 20130222
- 1.1, 20130220
- 1.0, 20130219
- 0.4, 20130214
- 0.3, 20130213
- 0.2, 20130213
- 0.1, 20130212
[end]