diff --git a/_freeze/python/accessing-and-managing-financial-data/execute-results/html.json b/_freeze/python/accessing-and-managing-financial-data/execute-results/html.json index 1e822669..4978b5db 100644 --- a/_freeze/python/accessing-and-managing-financial-data/execute-results/html.json +++ b/_freeze/python/accessing-and-managing-financial-data/execute-results/html.json @@ -1,15 +1,15 @@ { - "hash": "9fd328b50bbd06951670e009c36376c7", + "hash": "7f6920c8504bdf5ab2869b676468fe9a", "result": { "engine": "jupyter", - "markdown": "---\ntitle: Accessing and Managing Financial Data\nmetadata:\n pagetitle: Accessing and Managing Financial Data with Python\n description-meta: Download and organize open-source financial data using the programming language Python. \n---\n\n\n\n::: callout-note\nYou are reading **Tidy Finance with Python**. You can find the equivalent chapter for the sibling **Tidy Finance with R** [here](../r/accessing-and-managing-financial-data.qmd).\n:::\n\nIn this chapter, we suggest a way to organize your financial data. Everybody who has experience with data is also familiar with storing data in various formats like CSV, XLS, XLSX, or other delimited value storage. Reading and saving data can become very cumbersome when using different data formats and across different projects. Moreover, storing data in delimited files often leads to problems with respect to column type consistency. For instance, date-type columns frequently lead to inconsistencies across different data formats and programming languages. \n\nThis chapter shows how to import different open-source datasets. Specifically, our data comes from the application programming interface (API) of Yahoo Finance, a downloaded standard CSV file, an XLSX file stored in a public Google Drive repository, and other macroeconomic time series.\\index{API} We store all the data in a *single* database, which serves as the only source of data in subsequent chapters. We conclude the chapter by providing some tips on managing databases.\\index{Database}\n\nFirst, we load the Python packages that we use throughout this chapter. Later on, we load more packages in the sections where we need them. \n\n::: {#52142987 .cell execution_count=2}\n``` {.python .cell-code}\nimport pandas as pd\nimport numpy as np\nimport tidyfinance as tf\n```\n:::\n\n\nMoreover, we initially define the date range for which we fetch and store the financial data, making future data updates tractable. In case you need another time frame, you can adjust the dates below. Our data starts with 1960 since most asset pricing studies use data from 1962 on.\n\n::: {#dd94c04a .cell execution_count=3}\n``` {.python .cell-code}\nstart_date = \"1960-01-01\"\nend_date = \"2024-12-31\"\n```\n:::\n\n\n## Fama-French Data\n\nWe start by downloading some famous Fama-French factors [e.g., @Fama1993] and portfolio returns commonly used in empirical asset pricing. Fortunately, the `pandas-datareader` package provides a simple interface to read data from Kenneth French's Data Library.\\index{Data!Fama-French factors}\\index{Kenneth French homepage}\n\n::: {#4b8d1ab2 .cell execution_count=4}\n``` {.python .cell-code}\nimport pandas_datareader as pdr\n```\n:::\n\n\nWe can use the `pdr.DataReader()` function of the package to download monthly Fama-French factors. The set *Fama/French 3 Factors* contains the return time series of the market (`mkt_excess`), size (`smb`), and value (`hml`) factors alongside the risk-free rates (`rf`). Note that we have to do some manual work to parse all the columns correctly and scale them appropriately, as the raw Fama-French data comes in a unique data format. For precise descriptions of the variables, we suggest consulting Prof. Kenneth French's finance data library directly. If you are on the website, check the raw data files to appreciate the time you can save thanks to`pandas_datareader`.\\index{Factor!Market}\\index{Factor!Size}\\index{Factor!Value}\\index{Factor!Profitability}\\index{Factor!Investment}\\index{Risk-free rate}\n\n::: {#dbee5fef .cell execution_count=5}\n``` {.python .cell-code}\nfactors_ff3_monthly_raw = pdr.DataReader(\n name=\"F-F_Research_Data_Factors\",\n data_source=\"famafrench\", \n start=start_date, \n end=end_date)[0]\n\nfactors_ff3_monthly = (factors_ff3_monthly_raw\n .divide(100)\n .reset_index(names=\"date\")\n .assign(date=lambda x: pd.to_datetime(x[\"date\"].astype(str)))\n .rename(str.lower, axis=\"columns\")\n .rename(columns={\"mkt-rf\": \"mkt_excess\"})\n)\n```\n:::\n\n\nWe also download the set *5 Factors (2x3)*, which additionally includes the return time series of the profitability (`rmw`) and investment (`cma`) factors. We demonstrate how the monthly factors are constructed in [Replicating Fama and French Factors](replicating-fama-and-french-factors.qmd).\n\n::: {#9e9e1781 .cell execution_count=6}\n``` {.python .cell-code}\nfactors_ff5_monthly_raw = pdr.DataReader(\n name=\"F-F_Research_Data_5_Factors_2x3\",\n data_source=\"famafrench\", \n start=start_date, \n end=end_date)[0]\n\nfactors_ff5_monthly = (factors_ff5_monthly_raw\n .divide(100)\n .reset_index(names=\"date\")\n .assign(date=lambda x: pd.to_datetime(x[\"date\"].astype(str)))\n .rename(str.lower, axis=\"columns\")\n .rename(columns={\"mkt-rf\": \"mkt_excess\"})\n)\n```\n:::\n\n\nIt is straightforward to download the corresponding *daily* Fama-French factors with the same function. \n\n::: {#f0848293 .cell execution_count=7}\n``` {.python .cell-code}\nfactors_ff3_daily_raw = pdr.DataReader(\n name=\"F-F_Research_Data_Factors_daily\",\n data_source=\"famafrench\", \n start=start_date, \n end=end_date)[0]\n\nfactors_ff3_daily = (factors_ff3_daily_raw\n .divide(100)\n .reset_index(names=\"date\")\n .rename(str.lower, axis=\"columns\")\n .rename(columns={\"mkt-rf\": \"mkt_excess\"})\n)\n```\n:::\n\n\nIn a subsequent chapter, we also use the monthly returns from ten industry portfolios, so let us fetch that data, too.\\index{Data!Industry portfolios}\n\n::: {#cbe8cb33 .cell execution_count=8}\n``` {.python .cell-code}\nindustries_ff_monthly_raw = pdr.DataReader(\n name=\"10_Industry_Portfolios\",\n data_source=\"famafrench\", \n start=start_date, \n end=end_date)[0]\n\nindustries_ff_monthly = (industries_ff_monthly_raw\n .divide(100)\n .reset_index(names=\"date\")\n .assign(date=lambda x: pd.to_datetime(x[\"date\"].astype(str)))\n .rename(str.lower, axis=\"columns\")\n)\n```\n:::\n\n\nIt is worth taking a look at all available portfolio return time series from Kenneth French's homepage. You should check out the other sets by calling `pdr.famafrench.get_available_datasets()`.\n\nTo automatically download and process Fama-French data, you can also use the `tidyfinance` package with `domain=\"factors_ff\"` and the corresponding dataset, e.g.:\n\n::: {#557adcfb .cell execution_count=9}\n``` {.python .cell-code}\ntf.download_data(\n domain=\"factors_ff\",\n dataset=\"F-F_Research_Data_Factors\", \n start_date=start_date, \n end_date=end_date\n)\n```\n:::\n\n\nThe `tidyfinance` package implements the processing steps as above and returns the same cleaned data frame. \n\n## q-Factors\n\nIn recent years, the academic discourse experienced the rise of alternative factor models, e.g., in the form of the @Hou2015 *q*-factor model. We refer to the [extended background](http://global-q.org/background.html) information provided by the original authors for further information. The *q*-factors can be downloaded directly from the authors' homepage from within `pd.read_csv()`. \\index{Data!q-factors}\\index{Factor!q-factors}\n\nWe also need to adjust this data. First, we discard information we will not use in the remainder of the book. Then, we rename the columns with the \"R_\"-prescript using regular expressions and write all column names in lowercase. We then query the data to select observations between the start and end dates. Finally, we use the double asterisk (`**`) notation in the `assign` function to apply the same transform of dividing by 100 to all four factors by iterating through them. You should always try sticking to a consistent style for naming objects, which we try to illustrate here - the emphasis is on *try*. You can check out style guides available online, e.g., [Hadley Wickham's `tidyverse` style guide.](https://style.tidyverse.org/index.html)\\index{Style guide} note that we temporarily adjust the SSL certificate handling behavior in Python’s \n`ssl` module when retrieving the $q$-factors directly from the web, as demonstrated in [Working with Stock Returns](working-with-stock-returns.qmd). This method should be used with caution, which is why we restore the default settings immediately after successfully downloading the data.\n\n::: {#1bcb6c34 .cell execution_count=10}\n``` {.python .cell-code}\nimport ssl\nssl._create_default_https_context = ssl._create_unverified_context\n\nfactors_q_monthly_link = (\n \"https://global-q.org/uploads/1/2/2/6/122679606/\"\n \"q5_factors_monthly_2024.csv\"\n)\n\nfactors_q_monthly = (pd.read_csv(factors_q_monthly_link)\n .assign(\n date=lambda x: (\n pd.to_datetime(x[\"year\"].astype(str) + \"-\" +\n x[\"month\"].astype(str) + \"-01\"))\n )\n .drop(columns=[\"R_F\", \"R_MKT\", \"year\"])\n .rename(columns=lambda x: x.replace(\"R_\", \"\").lower())\n .query(f\"date >= '{start_date}' and date <= '{end_date}'\")\n .assign(\n **{col: lambda x: x[col]/100 for col in [\"me\", \"ia\", \"roe\", \"eg\"]}\n )\n)\n\nssl._create_default_https_context = ssl.create_default_context\n```\n:::\n\n\nAgain, you can use the `tidyfinance` package for a shortcut:\n\n::: {#5cc217ab .cell execution_count=11}\n``` {.python .cell-code}\ntf.download_data(\n domain=\"factors_q\",\n dataset=\"q5_factors_monthly\", \n start_date=start_date, \n end_date=end_date\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=11}\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
daterisk_freemkt_excessmeiaroeeg
01967-01-010.0039270.0818520.068122-0.0292630.018813-0.025511
11967-02-010.0037430.0075570.016235-0.0029150.0353990.021792
21967-03-010.0036930.0401690.019836-0.0167720.018417-0.011192
31967-04-010.0033440.038786-0.006700-0.0289720.010253-0.016371
41967-05-010.003126-0.0428070.0274570.0218640.0059010.001191
........................
6912024-08-010.0044190.016518-0.0408170.0046870.0183690.008116
6922024-09-010.0046190.016806-0.011967-0.0000100.007408-0.032810
6932024-10-010.003907-0.009701-0.011261-0.011676-0.002314-0.008335
6942024-11-010.0039550.0650020.043985-0.049491-0.015370-0.021420
6952024-12-010.003663-0.031637-0.051564-0.003684-0.0214420.049624
\n

696 rows × 7 columns

\n
\n```\n:::\n:::\n\n\n## Macroeconomic Predictors\n\nOur next data source is a set of macroeconomic variables often used as predictors for the equity premium. @Goyal2008 comprehensively reexamine the performance of variables suggested by the academic literature to be good predictors of the equity premium. The authors host the data on [Amit Goyal's website.](https://sites.google.com/view/agoyal145) Since the data is an XLSX-file stored on a public Google Drive location, we need additional packages to access the data directly from our Python session. Usually, you need to authenticate if you interact with Google drive directly in Python. Since the data is stored via a public link, we can proceed without any authentication.\\index{Google Drive}\n\n::: {#7bd33a2b .cell execution_count=12}\n``` {.python .cell-code}\nsheet_id = \"1bM7vCWd3WOt95Sf9qjLPZjoiafgF_8EG\"\nsheet_name = \"macro_predictors.xlsx\"\nmacro_predictors_link = (\n f\"https://docs.google.com/spreadsheets/d/{sheet_id}\" \n f\"/gviz/tq?tqx=out:csv&sheet={sheet_name}\"\n)\n```\n:::\n\n\nNext, we read in the new data and transform the columns into the variables that we later use:\n\n1. The dividend price ratio (`dp`), the difference between the log of dividends and the log of prices, where dividends are 12-month moving sums of dividends paid on the S&P 500 index, and prices are monthly averages of daily closing prices [@Campbell1988; @Campbell2006]. \n1. Dividend yield (`dy`), the difference between the log of dividends and the log of lagged prices [@Ball1978]. \n1. Earnings price ratio (`ep`), the difference between the log of earnings and the log of prices, where earnings are 12-month moving sums of earnings on the S&P 500 index [@Campbell1988]. \n1. Dividend payout ratio (`de`), the difference between the log of dividends and the log of earnings [@Lamont1998]. \n1. Stock variance (`svar`), the sum of squared daily returns on the S&P 500 index [@Guo2006].\n1. Book-to-market ratio (`bm`), the ratio of book value to market value for the Dow Jones Industrial Average [@Kothari1997].\n1. Net equity expansion (`ntis`), the ratio of 12-month moving sums of net issues by NYSE listed stocks divided by the total end-of-year market capitalization of NYSE stocks [@Campbell2008].\n1. Treasury bills (`tbl`), the 3-Month Treasury Bill: Secondary Market Rate from the economic research database at the Federal Reserve Bank at St. Louis [@Campbell1987].\n1. Long-term yield (`lty`), the long-term government bond yield from Ibbotson's Stocks, Bonds, Bills, and Inflation Yearbook [@Goyal2008].\n1. Long-term rate of returns (`ltr`), the long-term government bond returns from Ibbotson's Stocks, Bonds, Bills, and Inflation Yearbook [@Goyal2008].\n1. Term spread (`tms`), the difference between the long-term yield on government bonds and the Treasury bill [@Campbell1987].\n1. Default yield spread (`dfy`), the difference between BAA and AAA-rated corporate bond yields [@Fama1989]. \n1. Inflation (`infl`), the Consumer Price Index (All Urban Consumers) from the Bureau of Labor Statistics [@Campbell2004].\n\t\t\t\nFor variable definitions and the required data transformations, you can consult the material on [Amit Goyal's website.](https://sites.google.com/view/agoyal145)\n\n::: {#3fc6df88 .cell execution_count=13}\n``` {.python .cell-code}\nssl._create_default_https_context = ssl._create_unverified_context\n\nmacro_predictors = (\n pd.read_csv(macro_predictors_link, thousands=\",\")\n .assign(\n date=lambda x: pd.to_datetime(x[\"yyyymm\"], format=\"%Y%m\"),\n dp=lambda x: np.log(x[\"D12\"])-np.log(x[\"Index\"]),\n dy=lambda x: np.log(x[\"D12\"])-np.log(x[\"Index\"].shift(1)),\n ep=lambda x: np.log(x[\"E12\"])-np.log(x[\"Index\"]),\n de=lambda x: np.log(x[\"D12\"])-np.log(x[\"E12\"]),\n tms=lambda x: x[\"lty\"]-x[\"tbl\"],\n dfy=lambda x: x[\"BAA\"]-x[\"AAA\"]\n )\n .rename(columns={\"b/m\": \"bm\"})\n .get([\"date\", \"dp\", \"dy\", \"ep\", \"de\", \"svar\", \"bm\", \n \"ntis\", \"tbl\", \"lty\", \"ltr\", \"tms\", \"dfy\", \"infl\"])\n .query(\"date >= @start_date and date <= @end_date\")\n .dropna()\n)\n\nssl._create_default_https_context = ssl.create_default_context\n```\n:::\n\n\nTo get the equivalent data through `tidyfinance`, you can call:\n\n::: {#5f267096 .cell execution_count=14}\n``` {.python .cell-code}\ntf.download_data(\n domain=\"macro_predictors\",\n dataset=\"monthly\",\n start_date=start_date, \n end_date=end_date\n)\n```\n:::\n\n\n## Other Macroeconomic Data\n\nThe Federal Reserve bank of St. Louis provides the Federal Reserve Economic Data (FRED), an extensive database for macroeconomic data. In total, there are 817,000 US and international time series from 108 different sources. As an illustration, we use the already familiar `pandas-datareader` package to fetch consumer price index (CPI) data that can be found under the [CPIAUCNS](https://fred.stlouisfed.org/series/CPIAUCNS) key.\\index{Data!FRED}\\index{Data!CPI}\n\n::: {#b0515f3b .cell execution_count=15}\n``` {.python .cell-code}\ncpi_monthly = (pdr.DataReader(\n name=\"CPIAUCNS\", \n data_source=\"fred\", \n start=start_date, \n end=end_date\n )\n .reset_index(names=\"date\")\n .rename(columns={\"CPIAUCNS\": \"cpi\"})\n .assign(cpi=lambda x: x[\"cpi\"] / x[\"cpi\"].iloc[-1])\n)\n```\n:::\n\n\nNote that we use the `assign()` in the last line to set the current (latest) price level as the reference inflation level. To download other time series, we just have to look it up on the FRED website and extract the corresponding key from the address. For instance, the producer price index for gold ores can be found under the [PCU2122212122210](https://fred.stlouisfed.org/series/PCU2122212122210) key.\n\nThe `tidyfinance` package can, of course, also fetch the same daily data and many more data series:\n\n::: {#ceef174a .cell execution_count=16}\n``` {.python .cell-code}\ntf.download_data(\n domain=\"fred\",\n series=\"CPIAUCNS\", \n start_date=start_date, \n end_date=end_date\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nFailed to retrieve data for series CPIAUCNS: Failed to perform, curl: (6) Could not resolve host: https. See https://curl.se/libcurl/c/libcurl-errors.html first for more details.\nFailed to retrieve data for series CPIAUCNS: 'date'\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=16}\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n
datevalueseries
\n
\n```\n:::\n:::\n\n\nTo download other time series, we just have to look it up on the FRED website and extract the corresponding key from the address. For instance, the producer price index for gold ores can be found under the [PCU2122212122210](https://fred.stlouisfed.org/series/PCU2122212122210) key. If your desired time series is not supported through tidyfinance, we recommend working with the `fredapi` package. Note that you need to get an API key to use its functionality. We refer to the package documentation for details.\n\n## Setting Up a Database\n\nNow that we have downloaded some (freely available) data from the web into the memory of our Python session, let us set up a database to store that information for future use. We will use the data stored in this database throughout the following chapters, but you could alternatively implement a different strategy and replace the respective code. \n\nThere are many ways to set up and organize a database, depending on the use case. For our purpose, the most efficient way is to use an [SQLite](https://SQLite.org/)-database, which is the C-language library that implements a small, fast, self-contained, high-reliability, full-featured SQL database engine. Note that [SQL](https://en.wikipedia.org/wiki/SQL) (Structured Query Language) is a standard language for accessing and manipulating databases.\\index{Database!SQLite}\n\n::: {#a10081e3 .cell execution_count=17}\n``` {.python .cell-code}\nimport sqlite3\n```\n:::\n\n\nAn SQLite-database is easily created - the code below is really all there is. You do not need any external software. Otherwise, date columns are stored and retrieved as integers.\\index{Database!Creation} We will use the file `tidy_finance_r.sqlite`, located in the data subfolder, to retrieve data for all subsequent chapters. The initial part of the code ensures that the directory is created if it does not already exist.\n\n::: {#4b49f781 .cell execution_count=18}\n``` {.python .cell-code}\nimport os\n\nif not os.path.exists(\"data\"):\n os.makedirs(\"data\")\n \ntidy_finance = sqlite3.connect(database=\"data/tidy_finance_python.sqlite\")\n```\n:::\n\n\nNext, we create a remote table with the monthly Fama-French factor data. We do so with the `pandas` function `to_sql()`, which copies the data to our SQLite-database.\n\n::: {#c2800478 .cell execution_count=19}\n``` {.python .cell-code}\n(factors_ff3_monthly\n .to_sql(name=\"factors_ff3_monthly\", \n con=tidy_finance, \n if_exists=\"replace\",\n index=False)\n)\n```\n:::\n\n\nNow, if we want to have the whole table in memory, we need to call `pd.read_sql_query()` with the corresponding query. You will see that we regularly load the data into the memory in the next chapters.\\index{Database!Read}\n\n::: {#dbe240b7 .cell execution_count=20}\n``` {.python .cell-code}\npd.read_sql_query(\n sql=\"SELECT date, rf FROM factors_ff3_monthly\",\n con=tidy_finance,\n parse_dates={\"date\"}\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=20}\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
daterf
01960-01-010.0033
11960-02-010.0029
21960-03-010.0035
31960-04-010.0019
41960-05-010.0027
.........
7752024-08-010.0048
7762024-09-010.0040
7772024-10-010.0039
7782024-11-010.0040
7792024-12-010.0037
\n

780 rows × 2 columns

\n
\n```\n:::\n:::\n\n\nThe last couple of code chunks are really all there is to organizing a simple database! You can also share the SQLite database across devices and programming languages. \n\nBefore we move on to the next data source, let us also store the other six tables in our new SQLite database. \n\n::: {#4a6705c7 .cell execution_count=21}\n``` {.python .cell-code}\ndata_dict = {\n \"factors_ff5_monthly\": factors_ff5_monthly,\n \"factors_ff3_daily\": factors_ff3_daily,\n \"industries_ff_monthly\": industries_ff_monthly, \n \"factors_q_monthly\": factors_q_monthly,\n \"macro_predictors\": macro_predictors,\n \"cpi_monthly\": cpi_monthly\n}\n\nfor key, value in data_dict.items():\n value.to_sql(name=key,\n con=tidy_finance, \n if_exists=\"replace\",\n index=False)\n```\n:::\n\n\nFrom now on, all you need to do to access data that is stored in the database is to follow two steps: (i) Establish the connection to the SQLite-database and (ii) execute the query to fetch the data. For your convenience, the following steps show all you need in a compact fashion.\\index{Database!Connection}\n\n::: {#045cddbc .cell message='false' results='false' execution_count=22}\n``` {.python .cell-code}\nimport pandas as pd\nimport sqlite3\n\ntidy_finance = sqlite3.connect(database=\"data/tidy_finance_python.sqlite\")\n\nfactors_q_monthly = pd.read_sql_query(\n sql=\"SELECT * FROM factors_q_monthly\",\n con=tidy_finance,\n parse_dates={\"date\"}\n)\n```\n:::\n\n\n## Managing SQLite Databases\n\nFinally, at the end of our data chapter, we revisit the SQLite database itself. When you drop database objects such as tables or delete data from tables, the database file size remains unchanged because SQLite just marks the deleted objects as free and reserves their space for future uses. As a result, the database file always grows in size.\\index{Database!Management}\n\nTo optimize the database file, you can run the `VACUUM` command in the database, which rebuilds the database and frees up unused space. You can execute the command in the database using the `execute()` function. \n\n::: {#b530109e .cell execution_count=23}\n``` {.python .cell-code}\ntidy_finance.execute(\"VACUUM\")\n```\n:::\n\n\nThe `VACUUM` command actually performs a couple of additional cleaning steps, which you can read about in [this tutorial.](https://SQLite.org/docs/sql/statements/vacuum.html) \\index{Database!Cleaning}\n\n## Key Takeaways\n\n- Importing Fama-French factors, q-factors, macroeconomic indicators, and CPI data is simplified through API calls, CSV parsing, and web scraping techniques.\n- The `tidyfinance` Python package offers pre-processed access to financial datasets, reducing manual data cleaning and saving valuable time.\n- Creating a centralized SQLite database helps manage and organize data efficiently across projects, while maintaining reproducibility.\n- Structured database storage supports scalable data access, which is essential for long-term academic projects and collaborative work in finance.\n\n## Exercises\n\n1. Download the monthly Fama-French factors manually from [Kenneth French's data library](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html) and read them in via `pd.read_csv()`. Validate that you get the same data as via the `pandas-datareader` package. \n1. Download the daily Fama-French 5 factors using the `pdr.DataReader()` package. After the successful download and conversion to the column format that we used above, compare the `rf`, `mkt_excess`, `smb`, and `hml` columns of `factors_ff3_daily` to `factors_ff5_daily`. Discuss any differences you might find. \n\n", + "markdown": "---\ntitle: Accessing and Managing Financial Data\nmetadata:\n pagetitle: Accessing and Managing Financial Data with Python\n description-meta: Download and organize open-source financial data using the programming language Python. \n---\n\n\n\n::: callout-note\nYou are reading **Tidy Finance with Python**. You can find the equivalent chapter for the sibling **Tidy Finance with R** [here](../r/accessing-and-managing-financial-data.qmd).\n:::\n\nIn this chapter, we suggest a way to organize your financial data. Everybody who has experience with data is also familiar with storing data in various formats like CSV, XLS, XLSX, or other delimited value storage. Reading and saving data can become very cumbersome when using different data formats and across different projects. Moreover, storing data in delimited files often leads to problems with respect to column type consistency. For instance, date-type columns frequently lead to inconsistencies across different data formats and programming languages. \n\nThis chapter shows how to import different open-source datasets. Specifically, our data comes from the application programming interface (API) of Yahoo Finance, a downloaded standard CSV file, an XLSX file stored in a public Google Drive repository, and other macroeconomic time series.\\index{API} We store all the data in a *single* database, which serves as the only source of data in subsequent chapters. We conclude the chapter by providing some tips on managing databases.\\index{Database}\n\nFirst, we load the Python packages that we use throughout this chapter. Later on, we load more packages in the sections where we need them. \n\n::: {#2068e538 .cell execution_count=2}\n``` {.python .cell-code}\nimport pandas as pd\nimport numpy as np\nimport io\nimport re\nimport zipfile\nfrom curl_cffi import requests\n```\n:::\n\n\nMoreover, we initially define the date range for which we fetch and store the financial data, making future data updates tractable. In case you need another time frame, you can adjust the dates below. Our data starts with 1960 since most asset pricing studies use data from 1962 on.\n\n::: {#c7108e44 .cell execution_count=3}\n``` {.python .cell-code}\nstart_date = \"1960-01-01\"\nend_date = \"2024-12-31\"\n```\n:::\n\n\n## Fama-French Data\n\nWe start by downloading some famous Fama-French factors [e.g., @Fama1993] and portfolio returns commonly used in empirical asset pricing. The data are freely available from Kenneth French’s Data Library, but the raw files come in a rather idiosyncratic format. If you access the data via the website, the manual *raw* workflow looks like this:\n\n1. Go to the website\n1. Find the right dataset\n1. Download a ZIP file\n1. Extract the CSV inside\n1. Select the right data table from the file and import the table into Python\n1. Clean the dates, scale the returns, fix column names, handle missing values, etc.\n\nDoing this once is fine; doing it repeatedly across projects is exactly the type of boilerplate that’s easy to mess up and annoying to maintain. It is therefore natural to automate these steps in Python.\n\n# From manual steps to a download script\n\nA minimal download script mirrors the manual steps one by one. For example, to fetch a Fama–French dataset you first construct the URL:\n\n::: {#9ab8e5db .cell execution_count=4}\n``` {.python .cell-code}\ndataset = \"F-F_Research_Data_Factors\"\nbase_url = \"http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/\"\nurl = f\"{base_url}{dataset}_CSV.zip\"\n```\n:::\n\n\nNext, you replace the browser download with an HTTP request and extract the ZIP in memory:\n\n::: {#aafb772e .cell execution_count=5}\n``` {.python .cell-code}\nresp = requests.get(url)\nresp.raise_for_status()\n\nwith zipfile.ZipFile(io.BytesIO(resp.content)) as zf:\n file_name = zf.namelist()[0] # Ken French ZIPs contain one file\n raw_text = zf.read(file_name).decode(\"latin1\")\n```\n:::\n\n\nThe most important part of this chunk is the `requests.get()` call. This is the moment where we replace all the manual browser work (open the website, click download, save the file) with a single, reproducible line of code. Then, calling `raise_for_status()` ensures we stop immediately if the server returns an error (e.g. HTTP 404 or 500) instead of quietly handling a broken file. Once this succeeds, `resp.content` is guaranteed to contain valid ZIP bytes that we can open in memory.\n\nThe raw file contains documentation text followed by the actual data table(s). To emulate *scrolling down until the numbers start*, you can split the file into blocks and keep the long one that contains the table:\n\n::: {#61d53ef8 .cell execution_count=6}\n``` {.python .cell-code}\nchunks = raw_text.split(\"\\r\\n\\r\\n\")\ntable_text = max(chunks, key=len) \n```\n:::\n\n\nWithin this block, the first CSV header line starts at the first line beginning with a comma. We add a “Date” label for the index and pass everything to `read_csv`:\n\n::: {#d72929a3 .cell execution_count=7}\n``` {.python .cell-code}\nmatch = re.search(r\"^\\s*,\", table_text, flags=re.M)\nstart = match.start()\ncsv_text = \"Date\" + table_text[start:]\n\nfactors_ff_raw = pd.read_csv(io.StringIO(csv_text), index_col=0)\n```\n:::\n\n\nAt this point, the index still consists of integer date codes with different lengths depending on the frequency. We need a bit of logic to convert them into a proper `DatetimeIndex`:\n\n::: {#7a97a04a .cell execution_count=8}\n``` {.python .cell-code}\ns = factors_ff_raw.index.astype(str)\n\nif (s.str.len() == 8).all(): # daily: YYYYMMDD\n dt = pd.to_datetime(s, format=\"%Y%m%d\")\nelif (s.str.len() == 6).all(): # monthly: YYYYMM\n dt = pd.to_datetime(s + \"01\", format=\"%Y%m%d\")\nelif (s.str.len() == 4).all(): # annual: YYYY\n dt = pd.to_datetime(s + \"0101\", format=\"%Y%m%d\")\n dt = dt.dt.to_period(\"A-DEC\").dt.to_timestamp(\"end\")\nelse:\n raise ValueError(\"Unknown date format in Fama–French index.\")\n\nfactors_ff_raw = factors_ff_raw.set_index(dt)\nfactors_ff_raw.index.name = \"date\"\n```\n:::\n\n\nFinally, we still have to clean the data:\n\n- Convert returns from percent to decimal.\n- Standardize column names (e.g., all lower case and Mkt-RF to mkt_excess, RF to risk_free)\n- Replace special missing-value codes (-99.99, -999) with actual missing values\n- Filter the data by a start and end date\n\nThis all could look like this:\n\n::: {#77d01e70 .cell execution_count=9}\n``` {.python .cell-code}\n# start and end dates\nif start_date:\n factors_ff_raw = factors_ff_raw[factors_ff_raw.index >= pd.to_datetime(start_date)]\nif end_date:\n factors_ff_raw = factors_ff_raw[factors_ff_raw.index <= pd.to_datetime(end_date)]\n\nfactors_ff3_monthly = (factors_ff_raw\n .div(100)\n .reset_index(names=\"date\")\n .rename(columns=str.lower)\n .rename(columns={\"mkt-rf\": \"mkt_excess\", \"rf\": \"risk_free\"})\n .replace({\"-99.99\": pd.NA, -99.99: pd.NA, -999: pd.NA})\n)\nfactors_ff3_monthly\n```\n\n::: {.cell-output .cell-output-display execution_count=37}\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
datemkt_excesssmbhmlrisk_free
01960-01-01-0.06980.02120.02650.0033
11960-02-010.01160.0060-0.01970.0029
21960-03-01-0.0163-0.0055-0.02750.0035
31960-04-01-0.01710.0022-0.02140.0019
41960-05-010.03120.0129-0.03730.0027
..................
7752024-08-010.0160-0.0349-0.01100.0048
7762024-09-010.0172-0.0013-0.02770.0040
7772024-10-01-0.0100-0.00990.00860.0039
7782024-11-010.06490.04460.00150.0040
7792024-12-01-0.0317-0.0271-0.03000.0037
\n

780 rows × 5 columns

\n
\n```\n:::\n:::\n\n\nAll of these steps are doable, but none of them are really about finance - they are just the technical scaffolding required before you can work with the actual factor returns. That’s where a dedicated helper or package becomes invaluable. The `tidyfinance` package performs this entire workflow under the hood: you request a Fama–French dataset and receive a clean, consistently formatted data table from Kenneth French's Data Library.\\index{Data!Fama-French factors}\\index{Kenneth French homepage}. This avoids repetitive boilerplate, reduces errors, and lets you focus on modeling and analysis rather than on data plumbing.\n\n# Using `tidyfinance` instead of reimplementing the plumbing\n\n::: {#18f92154 .cell execution_count=10}\n``` {.python .cell-code}\nimport tidyfinance as tf\n```\n:::\n\n\nFor example, we can use the `tf.download_data()` function of the package to download monthly Fama-French factors. The set *Fama/French 3 Factors* contains the return time series of the market (`mkt_excess`), size (`smb`), and value (`hml`) factors alongside the risk-free rates (`risk_free`). Note that the `tf.download_data()` function parses all the columns correctly and already scale them appropriately, as the raw Fama-French data comes in a unique data format. For precise descriptions of the variables, we suggest consulting Prof. Kenneth French's finance data library directly. If you are on the website, check the raw data files to appreciate the time you can save thanks to the `tidyfinance` package.\\index{Factor!Market}\\index{Factor!Size}\\index{Factor!Value}\\index{Factor!Profitability}\\index{Factor!Investment}\\index{Risk-free rate}\n\n::: {#fd60bf92 .cell execution_count=11}\n``` {.python .cell-code}\nfactors_ff3_monthly = tf.download_data(\n domain=\"famafrench\",\n dataset=\"F-F_Research_Data_Factors\",\n start_date=start_date,\n end_date=end_date,\n)\n```\n:::\n\n\nWe also download the set *5 Factors (2x3)*, which additionally includes the return time series of the profitability (`rmw`) and investment (`cma`) factors. We demonstrate how the monthly factors are constructed in [Replicating Fama and French Factors](replicating-fama-and-french-factors.qmd).\n\n::: {#bf347fbd .cell execution_count=12}\n``` {.python .cell-code}\nfactors_ff5_monthly = tf.download_data(\n domain=\"famafrench\",\n dataset=\"F-F_Research_Data_5_Factors_2x3\",\n start_date=start_date,\n end_date=end_date,\n)\n```\n:::\n\n\nIt is straightforward to download the corresponding *daily* Fama-French factors with the same function. \n\n::: {#7708ed92 .cell execution_count=13}\n``` {.python .cell-code}\nfactors_ff3_daily = tf.download_data(\n domain=\"famafrench\",\n dataset=\"F-F_Research_Data_Factors_daily\",\n start_date=start_date,\n end_date=end_date,\n)\n```\n:::\n\n\nIn a subsequent chapter, we also use the monthly returns from ten industry portfolios, so let us fetch that data, too.\\index{Data!Industry portfolios}\n\n::: {#6a04aed2 .cell execution_count=14}\n``` {.python .cell-code}\nindustries_ff_monthly = tf.download_data(\n domain=\"famafrench\",\n dataset=\"10_Industry_Portfolios\",\n start_date=start_date,\n end_date=end_date,\n)\n```\n:::\n\n\nIt is worth taking a look at all available portfolio return time series from Kenneth French's homepage. You should check out the other sets by calling `tf.get_available_famafrench_datasets()`.\n\n## q-Factors\n\nIn recent years, the academic discourse experienced the rise of alternative factor models, e.g., in the form of the @Hou2015 *q*-factor model. We refer to the [extended background](http://global-q.org/background.html) information provided by the original authors for further information. The *q*-factors can be downloaded directly from the authors' homepage from within `pd.read_csv()`. \\index{Data!q-factors}\\index{Factor!q-factors}\n\nWe also need to adjust this data. First, we discard information we will not use in the remainder of the book. Then, we rename the columns with the \"R_\"-prescript using regular expressions and write all column names in lowercase. We then query the data to select observations between the start and end dates. Finally, we use the double asterisk (`**`) notation in the `assign` function to apply the same transform of dividing by 100 to all four factors by iterating through them. You should always try sticking to a consistent style for naming objects, which we try to illustrate here - the emphasis is on *try*. You can check out style guides available online, e.g., [Hadley Wickham's `tidyverse` style guide.](https://style.tidyverse.org/index.html)\\index{Style guide} note that we temporarily adjust the SSL certificate handling behavior in Python’s \n`ssl` module when retrieving the $q$-factors directly from the web, as demonstrated in [Working with Stock Returns](working-with-stock-returns.qmd). This method should be used with caution, which is why we restore the default settings immediately after successfully downloading the data.\n\n::: {#4bf4afe7 .cell execution_count=15}\n``` {.python .cell-code}\nimport ssl\nssl._create_default_https_context = ssl._create_unverified_context\n\nfactors_q_monthly_link = (\n \"https://global-q.org/uploads/1/2/2/6/122679606/\"\n \"q5_factors_monthly_2024.csv\"\n)\n\nfactors_q_monthly = (pd.read_csv(factors_q_monthly_link)\n .assign(\n date=lambda x: (\n pd.to_datetime(x[\"year\"].astype(str) + \"-\" +\n x[\"month\"].astype(str) + \"-01\"))\n )\n .drop(columns=[\"R_F\", \"R_MKT\", \"year\"])\n .rename(columns=lambda x: x.replace(\"R_\", \"\").lower())\n .query(f\"date >= '{start_date}' and date <= '{end_date}'\")\n .assign(\n **{col: lambda x: x[col]/100 for col in [\"me\", \"ia\", \"roe\", \"eg\"]}\n )\n)\n\nssl._create_default_https_context = ssl.create_default_context\n```\n:::\n\n\nAgain, you can use the `tidyfinance` package for a shortcut:\n\n::: {#87c32f1a .cell execution_count=16}\n``` {.python .cell-code}\ntf.download_data(\n domain=\"factors_q\",\n dataset=\"q5_factors_monthly\", \n start_date=start_date, \n end_date=end_date\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=44}\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
daterisk_freemkt_excessmeiaroeeg
01967-01-010.0039270.0818520.068122-0.0292630.018813-0.025511
11967-02-010.0037430.0075570.016235-0.0029150.0353990.021792
21967-03-010.0036930.0401690.019836-0.0167720.018417-0.011192
31967-04-010.0033440.038786-0.006700-0.0289720.010253-0.016371
41967-05-010.003126-0.0428070.0274570.0218640.0059010.001191
........................
6912024-08-010.0044190.016518-0.0408170.0046870.0183690.008116
6922024-09-010.0046190.016806-0.011967-0.0000100.007408-0.032810
6932024-10-010.003907-0.009701-0.011261-0.011676-0.002314-0.008335
6942024-11-010.0039550.0650020.043985-0.049491-0.015370-0.021420
6952024-12-010.003663-0.031637-0.051564-0.003684-0.0214420.049624
\n

696 rows × 7 columns

\n
\n```\n:::\n:::\n\n\n## Macroeconomic Predictors\n\nOur next data source is a set of macroeconomic variables often used as predictors for the equity premium. @Goyal2008 comprehensively reexamine the performance of variables suggested by the academic literature to be good predictors of the equity premium. The authors host the data on [Amit Goyal's website.](https://sites.google.com/view/agoyal145) Since the data is an XLSX-file stored on a public Google Drive location, we need additional packages to access the data directly from our Python session. Usually, you need to authenticate if you interact with Google drive directly in Python. Since the data is stored via a public link, we can proceed without any authentication.\\index{Google Drive}\n\n::: {#6ed1395b .cell execution_count=17}\n``` {.python .cell-code}\nsheet_id = \"1bM7vCWd3WOt95Sf9qjLPZjoiafgF_8EG\"\nsheet_name = \"macro_predictors.xlsx\"\nmacro_predictors_link = (\n f\"https://docs.google.com/spreadsheets/d/{sheet_id}\" \n f\"/gviz/tq?tqx=out:csv&sheet={sheet_name}\"\n)\n```\n:::\n\n\nNext, we read in the new data and transform the columns into the variables that we later use:\n\n1. The dividend price ratio (`dp`), the difference between the log of dividends and the log of prices, where dividends are 12-month moving sums of dividends paid on the S&P 500 index, and prices are monthly averages of daily closing prices [@Campbell1988; @Campbell2006]. \n1. Dividend yield (`dy`), the difference between the log of dividends and the log of lagged prices [@Ball1978]. \n1. Earnings price ratio (`ep`), the difference between the log of earnings and the log of prices, where earnings are 12-month moving sums of earnings on the S&P 500 index [@Campbell1988]. \n1. Dividend payout ratio (`de`), the difference between the log of dividends and the log of earnings [@Lamont1998]. \n1. Stock variance (`svar`), the sum of squared daily returns on the S&P 500 index [@Guo2006].\n1. Book-to-market ratio (`bm`), the ratio of book value to market value for the Dow Jones Industrial Average [@Kothari1997].\n1. Net equity expansion (`ntis`), the ratio of 12-month moving sums of net issues by NYSE listed stocks divided by the total end-of-year market capitalization of NYSE stocks [@Campbell2008].\n1. Treasury bills (`tbl`), the 3-Month Treasury Bill: Secondary Market Rate from the economic research database at the Federal Reserve Bank at St. Louis [@Campbell1987].\n1. Long-term yield (`lty`), the long-term government bond yield from Ibbotson's Stocks, Bonds, Bills, and Inflation Yearbook [@Goyal2008].\n1. Long-term rate of returns (`ltr`), the long-term government bond returns from Ibbotson's Stocks, Bonds, Bills, and Inflation Yearbook [@Goyal2008].\n1. Term spread (`tms`), the difference between the long-term yield on government bonds and the Treasury bill [@Campbell1987].\n1. Default yield spread (`dfy`), the difference between BAA and AAA-rated corporate bond yields [@Fama1989]. \n1. Inflation (`infl`), the Consumer Price Index (All Urban Consumers) from the Bureau of Labor Statistics [@Campbell2004].\n\t\t\t\nFor variable definitions and the required data transformations, you can consult the material on [Amit Goyal's website.](https://sites.google.com/view/agoyal145)\n\n::: {#af3f3685 .cell execution_count=18}\n``` {.python .cell-code}\nssl._create_default_https_context = ssl._create_unverified_context\n\nmacro_predictors = (\n pd.read_csv(macro_predictors_link, thousands=\",\")\n .assign(\n date=lambda x: pd.to_datetime(x[\"yyyymm\"], format=\"%Y%m\"),\n dp=lambda x: np.log(x[\"D12\"])-np.log(x[\"Index\"]),\n dy=lambda x: np.log(x[\"D12\"])-np.log(x[\"Index\"].shift(1)),\n ep=lambda x: np.log(x[\"E12\"])-np.log(x[\"Index\"]),\n de=lambda x: np.log(x[\"D12\"])-np.log(x[\"E12\"]),\n tms=lambda x: x[\"lty\"]-x[\"tbl\"],\n dfy=lambda x: x[\"BAA\"]-x[\"AAA\"]\n )\n .rename(columns={\"b/m\": \"bm\"})\n .get([\"date\", \"dp\", \"dy\", \"ep\", \"de\", \"svar\", \"bm\", \n \"ntis\", \"tbl\", \"lty\", \"ltr\", \"tms\", \"dfy\", \"infl\"])\n .query(\"date >= @start_date and date <= @end_date\")\n .dropna()\n)\n\nssl._create_default_https_context = ssl.create_default_context\n```\n:::\n\n\nTo get the equivalent data through `tidyfinance`, you can call:\n\n::: {#fa0b3e29 .cell execution_count=19}\n``` {.python .cell-code}\ntf.download_data(\n domain=\"macro_predictors\",\n dataset=\"monthly\",\n start_date=start_date, \n end_date=end_date\n)\n```\n:::\n\n\n## Other Macroeconomic Data\n\nThe Federal Reserve bank of St. Louis provides the Federal Reserve Economic Data (FRED), an extensive database for macroeconomic data. In total, there are 817,000 US and international time series from 108 different sources. As an illustration, we use the `tidyfinance` package to fetch consumer price index (CPI) data that can be found under the [CPIAUCNS](https://fred.stlouisfed.org/series/CPIAUCNS) key.\\index{Data!FRED}\\index{Data!CPI}\n\n::: {#3d801a71 .cell execution_count=20}\n``` {.python .cell-code}\nseries = \"CPIAUCNS\"\nurl = f\"https://fred.stlouisfed.org/graph/fredgraph.csv?id={series}\"\n```\n:::\n\n\nWe can then use the `requests` module to request the CSV, extract the data from the response body, and convert the columns to a tidy format:\n\n::: {#80fe0fbe .cell execution_count=21}\n``` {.python .cell-code}\nresp = requests.get(url)\nresp_csv = pd.io.common.StringIO(resp.text)\n\ncpi_monthly = (pd.read_csv(resp_csv)\n .assign(\n date=lambda x: pd.to_datetime(x[\"observation_date\"]),\n value=lambda x: pd.to_numeric(\n x[series], errors=\"coerce\"\n ),\n series=series,\n )\n .get([\"date\", \"series\", \"value\"])\n .query(\"date >= @start_date & date <= @end_date\")\n .assign(cpi=lambda x: x[\"value\"] / x[\"value\"].iloc[-1])\n)\n```\n:::\n\n\nThe last line sets the current (latest) price level as the reference price level.\n\nThe `tidyfinance` package can, of course, also fetch the same index data and many more data series:\n\n::: {#b94f5bd0 .cell execution_count=22}\n``` {.python .cell-code}\ntf.download_data(\n domain=\"fred\",\n series = \"CPIAUCNS\",\n start_date = start_date,\n end_date = end_date\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=50}\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
dateseriesvalue
01960-01-01CPIAUCNS29.300
11960-02-01CPIAUCNS29.400
21960-03-01CPIAUCNS29.400
31960-04-01CPIAUCNS29.500
41960-05-01CPIAUCNS29.500
............
7752024-08-01CPIAUCNS314.796
7762024-09-01CPIAUCNS315.301
7772024-10-01CPIAUCNS315.664
7782024-11-01CPIAUCNS315.493
7792024-12-01CPIAUCNS315.605
\n

780 rows × 3 columns

\n
\n```\n:::\n:::\n\n\nTo download other time series, we just have to look it up on the FRED website and extract the corresponding key from the address. For instance, the producer price index for gold ores can be found under the [PCU2122212122210](https://fred.stlouisfed.org/series/PCU2122212122210) key. If your desired time series is not supported through tidyfinance, we recommend working with the `fredapi` package. Note that you need to get an API key to use its functionality. We refer to the package documentation for details.\n\n## Setting Up a Database\n\nNow that we have downloaded some (freely available) data from the web into the memory of our Python session, let us set up a database to store that information for future use. We will use the data stored in this database throughout the following chapters, but you could alternatively implement a different strategy and replace the respective code. \n\nThere are many ways to set up and organize a database, depending on the use case. For our purpose, the most efficient way is to use an [SQLite](https://SQLite.org/)-database, which is the C-language library that implements a small, fast, self-contained, high-reliability, full-featured SQL database engine. Note that [SQL](https://en.wikipedia.org/wiki/SQL) (Structured Query Language) is a standard language for accessing and manipulating databases.\\index{Database!SQLite}\n\n::: {#b825745c .cell execution_count=23}\n``` {.python .cell-code}\nimport sqlite3\n```\n:::\n\n\nAn SQLite-database is easily created - the code below is really all there is. You do not need any external software. Otherwise, date columns are stored and retrieved as integers.\\index{Database!Creation} We will use the file `tidy_finance_r.sqlite`, located in the data subfolder, to retrieve data for all subsequent chapters. The initial part of the code ensures that the directory is created if it does not already exist.\n\n::: {#ac03dbae .cell execution_count=24}\n``` {.python .cell-code}\nimport os\n\nif not os.path.exists(\"data\"):\n os.makedirs(\"data\")\n \ntidy_finance = sqlite3.connect(database=\"data/tidy_finance_python.sqlite\")\n```\n:::\n\n\nNext, we create a remote table with the monthly Fama-French factor data. We do so with the `pandas` function `to_sql()`, which copies the data to our SQLite-database.\n\n::: {#244fccf8 .cell execution_count=25}\n``` {.python .cell-code}\n(factors_ff3_monthly\n .to_sql(name=\"factors_ff3_monthly\", \n con=tidy_finance, \n if_exists=\"replace\",\n index=False)\n)\n```\n:::\n\n\nNow, if we want to have the whole table in memory, we need to call `pd.read_sql_query()` with the corresponding query. You will see that we regularly load the data into the memory in the next chapters.\\index{Database!Read}\n\n::: {#dcab3728 .cell execution_count=26}\n``` {.python .cell-code}\npd.read_sql_query(\n sql=\"SELECT date, risk_free FROM factors_ff3_monthly\",\n con=tidy_finance,\n parse_dates={\"date\"}\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=54}\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
daterisk_free
01960-01-010.0033
11960-02-010.0029
21960-03-010.0035
31960-04-010.0019
41960-05-010.0027
.........
7752024-08-010.0048
7762024-09-010.0040
7772024-10-010.0039
7782024-11-010.0040
7792024-12-010.0037
\n

780 rows × 2 columns

\n
\n```\n:::\n:::\n\n\nThe last couple of code chunks are really all there is to organizing a simple database! You can also share the SQLite database across devices and programming languages. \n\nBefore we move on to the next data source, let us also store the other six tables in our new SQLite database. \n\n::: {#adf2106d .cell execution_count=27}\n``` {.python .cell-code}\ndata_dict = {\n \"factors_ff5_monthly\": factors_ff5_monthly,\n \"factors_ff3_daily\": factors_ff3_daily,\n \"industries_ff_monthly\": industries_ff_monthly, \n \"factors_q_monthly\": factors_q_monthly,\n \"macro_predictors\": macro_predictors,\n \"cpi_monthly\": cpi_monthly\n}\n\nfor key, value in data_dict.items():\n value.to_sql(name=key,\n con=tidy_finance, \n if_exists=\"replace\",\n index=False)\n```\n:::\n\n\nFrom now on, all you need to do to access data that is stored in the database is to follow two steps: (i) Establish the connection to the SQLite-database and (ii) execute the query to fetch the data. For your convenience, the following steps show all you need in a compact fashion.\\index{Database!Connection}\n\n::: {#6487b384 .cell message='false' results='false' execution_count=28}\n``` {.python .cell-code}\nimport pandas as pd\nimport sqlite3\n\ntidy_finance = sqlite3.connect(database=\"data/tidy_finance_python.sqlite\")\n\nfactors_q_monthly = pd.read_sql_query(\n sql=\"SELECT * FROM factors_q_monthly\",\n con=tidy_finance,\n parse_dates={\"date\"}\n)\n```\n:::\n\n\n## Managing SQLite Databases\n\nFinally, at the end of our data chapter, we revisit the SQLite database itself. When you drop database objects such as tables or delete data from tables, the database file size remains unchanged because SQLite just marks the deleted objects as free and reserves their space for future uses. As a result, the database file always grows in size.\\index{Database!Management}\n\nTo optimize the database file, you can run the `VACUUM` command in the database, which rebuilds the database and frees up unused space. You can execute the command in the database using the `execute()` function. \n\n::: {#28341992 .cell execution_count=29}\n``` {.python .cell-code}\ntidy_finance.execute(\"VACUUM\")\n```\n:::\n\n\nThe `VACUUM` command actually performs a couple of additional cleaning steps, which you can read about in [this tutorial.](https://SQLite.org/docs/sql/statements/vacuum.html) \\index{Database!Cleaning}\n\n## Key Takeaways\n\n- Importing Fama-French factors, q-factors, macroeconomic indicators, and CPI data is simplified through API calls, CSV parsing, and web scraping techniques.\n- The `tidyfinance` Python package offers pre-processed access to financial datasets, reducing manual data cleaning and saving valuable time.\n- Creating a centralized SQLite database helps manage and organize data efficiently across projects, while maintaining reproducibility.\n- Structured database storage supports scalable data access, which is essential for long-term academic projects and collaborative work in finance.\n\n## Exercises\n\n1. Download the monthly Fama-French factors manually from [Kenneth French's data library](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html) and read them in via `pd.read_csv()`. Validate that you get the same data as via the `tf.download_data()` package. \n1. Download the daily Fama-French 5 factors using the `tf.download_data()` function. After the successful download and conversion to the column format that we used above, compare the `risk_free`, `mkt_excess`, `smb`, and `hml` columns of `factors_ff3_daily` to `factors_ff5_daily`. Discuss any differences you might find. \n\n", "supporting": [ "accessing-and-managing-financial-data_files" ], "filters": [], "includes": { "include-in-header": [ - "\n\n\n" + "\n\n\n" ] } } diff --git a/_freeze/r/accessing-and-managing-financial-data/execute-results/html.json b/_freeze/r/accessing-and-managing-financial-data/execute-results/html.json index 6a81122a..646eed21 100644 --- a/_freeze/r/accessing-and-managing-financial-data/execute-results/html.json +++ b/_freeze/r/accessing-and-managing-financial-data/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "58cf68fe24e6b0c8b028f74da0c95bf6", + "hash": "20e6e3325026e6ce999b7fe854cd3f10", "result": { "engine": "knitr", - "markdown": "---\ntitle: Accessing and Managing Financial Data\naliases: \n - ../accessing-and-managing-financial-data.html\nmetadata:\n pagetitle: Accessing and Managing Financial Data with R\n description-meta: Download and organize open-source financial data using the programming language R. \n---\n\n::: callout-note\nYou are reading **Tidy Finance with R**. You can find the equivalent chapter for the sibling **Tidy Finance with Python** [here](../python/accessing-and-managing-financial-data.qmd).\n:::\n\nIn this chapter, we suggest a way to organize your financial data. Everybody who has experience with data is also familiar with storing data in various formats like CSV, XLS, XLSX, or other delimited value storage. Reading and saving data can become very cumbersome in the case of using different data formats, both across different projects and across different programming languages. Moreover, storing data in delimited files often leads to problems with respect to column type consistency. For instance, date-type columns frequently lead to inconsistencies across different data formats and programming languages.\n\nThis chapter shows how to import different open source data sets. Specifically, our data comes from the application programming interface (API) of Yahoo Finance, a downloaded standard CSV file, an XLSX file stored in a public Google Drive repository, and other macroeconomic time series that can be scraped directly from a website.\\index{API}\\index{Web scraping} We show how to process these raw data, as well as how to take a shortcut using the `tidyfinance` package, which provides a consistent interface to tidy financial data. We store all the data in a *single* database, which serves as the only source of data in subsequent chapters. We conclude the chapter by providing some tips on managing databases.\\index{Database}\n\nFirst, we load the global R packages that we use throughout this chapter. Later on, we load more packages in the sections where we need them.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(tidyfinance)\nlibrary(scales)\n```\n:::\n\n\nMoreover, we initially define the date range for which we fetch and store the financial data, making future data updates tractable. In case you need another time frame, you can adjust the dates below. Our data starts with 1960 since most asset pricing studies use data from 1962 on.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstart_date <- ymd(\"1960-01-01\")\nend_date <- ymd(\"2024-12-31\")\n```\n:::\n\n\n## Fama-French Data\n\nWe start by downloading some famous Fama-French factors [e.g., @Fama1993] and portfolio returns commonly used in empirical asset pricing. Fortunately, there is a neat package by [Nelson Areal](https://github.com/nareal/frenchdata/) that allows us to access the data easily: the `frenchdata` package provides functions to download and read data sets from [Prof. Kenneth French finance data library](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html) [@frenchdata].\\index{Data!Fama-French factors} \\index{Kenneth French homepage}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(frenchdata)\n```\n:::\n\n\nWe can use the `download_french_data()` function of the package to download monthly Fama-French factors. The set *Fama/French 3 Factors* contains the return time series of the market `mkt_excess`, size `smb` and value `hml` alongside the risk-free rates `rf`. Note that we have to do some manual work to correctly parse all the columns and scale them appropriately, as the raw Fama-French data comes in a very unpractical data format. For precise descriptions of the variables, we suggest consulting Prof. Kenneth French's finance data library directly. If you are on the website, check the raw data files to appreciate the time you can save thanks to `frenchdata`.\\index{Factor!Market}\\index{Factor!Size}\\index{Factor!Value}\\index{Factor!Profitability}\\index{Factor!Investment}\\index{Risk-free rate}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_monthly_raw <- download_french_data(\"Fama/French 3 Factors\")\nfactors_ff3_monthly <- factors_ff3_monthly_raw$subsets$data[[1]] |>\n mutate(\n date = floor_date(ymd(str_c(date, \"01\")), \"month\"),\n across(c(RF, `Mkt-RF`, SMB, HML), ~as.numeric(.) / 100),\n .keep = \"none\"\n ) |>\n rename_with(str_to_lower) |>\n rename(mkt_excess = `mkt-rf`) |> \n filter(date >= start_date & date <= end_date)\n```\n:::\n\n\nWe also download the set *5 Factors (2x3)*, which additionally includes the return time series of the profitability `rmw` and investment `cma` factors. We demonstrate how the monthly factors are constructed in the chapter [Replicating Fama and French Factors](replicating-fama-and-french-factors.qmd).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff5_monthly_raw <- download_french_data(\"Fama/French 5 Factors (2x3)\")\n\nfactors_ff5_monthly <- factors_ff5_monthly_raw$subsets$data[[1]] |>\n mutate(\n date = floor_date(ymd(str_c(date, \"01\")), \"month\"),\n across(c(RF, `Mkt-RF`, SMB, HML, RMW, CMA), ~as.numeric(.) / 100),\n .keep = \"none\"\n ) |>\n rename_with(str_to_lower) |>\n rename(mkt_excess = `mkt-rf`) |> \n filter(date >= start_date & date <= end_date)\n```\n:::\n\n\nIt is straightforward to download the corresponding *daily* Fama-French factors with the same function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_daily_raw <- download_french_data(\"Fama/French 3 Factors [Daily]\")\n\nfactors_ff3_daily <- factors_ff3_daily_raw$subsets$data[[1]] |>\n mutate(\n date = ymd(date),\n across(c(RF, `Mkt-RF`, SMB, HML), ~as.numeric(.) / 100),\n .keep = \"none\"\n ) |>\n rename_with(str_to_lower) |>\n rename(mkt_excess = `mkt-rf`) |>\n filter(date >= start_date & date <= end_date)\n```\n:::\n\n\nIn a subsequent chapter, we also use the 10 monthly industry portfolios, so let us fetch that data, too.\\index{Data!Industry portfolios}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nindustries_ff_monthly_raw <- download_french_data(\"10 Industry Portfolios\")\n\nindustries_ff_monthly <- industries_ff_monthly_raw$subsets$data[[1]] |>\n mutate(date = floor_date(ymd(str_c(date, \"01\")), \"month\")) |>\n mutate(across(where(is.numeric), ~ . / 100)) |>\n select(date, everything()) |>\n filter(date >= start_date & date <= end_date) |> \n rename_with(str_to_lower)\n```\n:::\n\n\nIt is worth taking a look at all available portfolio return time series from Kenneth French's homepage. You should check out the other sets by calling `get_french_data_list()`. \n\nTo automatically download and process Fama-French data, you can also use the `tidyfinance` package with `type = \"factors_ff_3_monthly\"` or similar, e.g.:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndownload_data(\n type = \"factors_ff_3_monthly\", \n start_date = start_date, \n end_date = end_date\n)\n```\n:::\n\n\nThe `tidyfinance` package implements the processing steps as above and returns the same cleaned data frame. The list of supported Fama-French data types can be called as follows:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlist_supported_types(domain = \"Fama-French\")\n```\n:::\n\n\n## q-Factors\n\nIn recent years, the academic discourse experienced the rise of alternative factor models, e.g., in the form of the @Hou2015 *q*-factor model. We refer to the [extended background](http://global-q.org/background.html) information provided by the original authors for further information. The *q* factors can be downloaded directly from the authors' homepage from within `read_csv()`.\\index{Data!q-factors}\\index{Factor!q-factors}\n\nWe also need to adjust this data. First, we discard information we will not use in the remainder of the book. Then, we rename the columns with the \"R\\_\"-prescript using regular expressions and write all column names in lowercase. You should always try sticking to a consistent style for naming objects, which we try to illustrate here - the emphasis is on *try*. You can check out style guides available online, e.g., [Hadley Wickham's `tidyverse` style guide.](https://style.tidyverse.org/index.html)\\index{Style guide}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_q_monthly_link <-\n \"https://global-q.org/uploads/1/2/2/6/122679606/q5_factors_monthly_2023.csv\"\n\nfactors_q_monthly <- read_csv(factors_q_monthly_link) |>\n mutate(date = ymd(str_c(year, month, \"01\", sep = \"-\"))) |>\n rename_with(~str_remove(., \"R_\")) |>\n rename_with(str_to_lower) |>\n mutate(across(-date, ~. / 100)) |>\n select(date, risk_free = f, mkt_excess = mkt, everything()) |>\n filter(date >= start_date & date <= end_date)\n```\n:::\n\n\nAgain, you can use the `tidyfinance` package for a shortcut:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndownload_data(\n type = \"factors_q5_monthly\", \n start_date = start_date, \n end_date = end_date\n)\n```\n:::\n\n\n## Macroeconomic Predictors\n\nOur next data source is a set of macroeconomic variables often used as predictors for the equity premium. @Goyal2008 comprehensively reexamine the performance of variables suggested by the academic literature to be good predictors of the equity premium. The authors host the data updated to 2022 on [Amit Goyal's website.](https://sites.google.com/view/agoyal145) The data is an XLSX-file stored on a public Google drive location and we directly export a CSV file.\\index{Data!Macro predictors}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsheet_id <- \"1bM7vCWd3WOt95Sf9qjLPZjoiafgF_8EG\"\nsheet_name <- \"Monthly\"\nmacro_predictors_url <- paste0(\n \"https://docs.google.com/spreadsheets/d/\", sheet_id,\n \"/gviz/tq?tqx=out:csv&sheet=\", sheet_name\n)\nmacro_predictors_raw <- read_csv(macro_predictors_url)\n```\n:::\n\n\nNext, we transform the columns into the variables that we later use:\n\n1. The dividend price ratio (`dp`), the difference between the log of dividends and the log of prices, where dividends are 12-month moving sums of dividends paid on the S&P 500 index, and prices are monthly averages of daily closing prices [@Campbell1988; @Campbell2006].\n2. Dividend yield (`dy`), the difference between the log of dividends and the log of lagged prices [@Ball1978].\n3. Earnings price ratio (`ep`), the difference between the log of earnings and the log of prices, where earnings are 12-month moving sums of earnings on the S&P 500 index [@Campbell1988].\n4. Dividend payout ratio (`de`), the difference between the log of dividends and the log of earnings [@Lamont1998].\n5. Stock variance (`svar`), the sum of squared daily returns on the S&P 500 index [@Guo2006].\n6. Book-to-market ratio (`bm`), the ratio of book value to market value for the Dow Jones Industrial Average [@Kothari1997].\n7. Net equity expansion (`ntis`), the ratio of 12-month moving sums of net issues by NYSE listed stocks divided by the total end-of-year market capitalization of NYSE stocks [@Campbell2008].\n8. Treasury bills (`tbl`), the 3-Month Treasury Bill: Secondary Market Rate from the economic research database at the Federal Reserve Bank at St. Louis [@Campbell1987].\n9. Long-term yield (`lty`), the long-term government bond yield from Ibbotson's Stocks, Bonds, Bills, and Inflation Yearbook [@Goyal2008].\n10. Long-term rate of returns (`ltr`), the long-term government bond returns from Ibbotson's Stocks, Bonds, Bills, and Inflation Yearbook [@Goyal2008].\n11. Term spread (`tms`), the difference between the long-term yield on government bonds and the Treasury bill [@Campbell1987].\n12. Default yield spread (`dfy`), the difference between BAA and AAA-rated corporate bond yields [@Fama1989].\n13. Inflation (`infl`), the Consumer Price Index (All Urban Consumers) from the Bureau of Labor Statistics [@Campbell2004].\n\nFor variable definitions and the required data transformations, you can consult the material on [Amit Goyal's website](https://sites.google.com/view/agoyal145).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmacro_predictors <- macro_predictors_raw |>\n mutate(date = ym(yyyymm)) |>\n mutate(across(where(is.character), as.numeric)) |>\n mutate(\n IndexDiv = Index + D12,\n logret = log(IndexDiv) - log(lag(IndexDiv)),\n Rfree = log(Rfree + 1),\n rp_div = lead(logret - Rfree, 1), # Future excess market return\n dp = log(D12) - log(Index), # Dividend Price ratio\n dy = log(D12) - log(lag(Index)), # Dividend yield\n ep = log(E12) - log(Index), # Earnings price ratio\n de = log(D12) - log(E12), # Dividend payout ratio\n tms = lty - tbl, # Term spread\n dfy = BAA - AAA # Default yield spread\n ) |>\n select(\n date, rp_div, dp, dy, ep, de, svar,\n bm = `b/m`, ntis, tbl, lty, ltr,\n tms, dfy, infl\n ) |>\n filter(date >= start_date & date <= end_date) |>\n drop_na()\n```\n:::\n\n\nTo get the equivalent data through `tidyfinance`, you can call:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndownload_data(\n type = \"macro_predictors_monthly\",\n start_date = start_date,\n end_date = end_date\n)\n```\n:::\n\n\n## Other Macroeconomic Data\n\nThe Federal Reserve bank of St. Louis provides the Federal Reserve Economic Data (FRED), an extensive database for macroeconomic data. In total, there are 817,000 US and international time series from 108 different sources. The data can be downloaded directly from FRED by constructing the appropriate URL. For instance, let us consider the consumer price index (CPI) data that can be found under the [CPIAUCNS](https://fred.stlouisfed.org/series/CPIAUCNS):\\index{Data!FRED}\\index{Data!CPI}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseries <- \"CPIAUCNS\"\ncpi_url <- paste0(\n \"https://fred.stlouisfed.org/graph/fredgraph.csv?id=\", series\n)\n```\n:::\n\n\nWe can then use the `httr2` [@httr2] package to request the CSV, extract the data from the response body, and convert the columns to a tidy format:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(httr2)\n\ncpi_daily <- request(cpi_url) |>\n req_perform() |>\n resp_body_string() |>\n read_csv() |>\n mutate(\n date = as.Date(observation_date),\n value = as.numeric(.data[[series]]),\n series = series,\n .keep = \"none\"\n )\n```\n:::\n\n\nWe convert the daily CPI data to monthly because we use the latter in later chapters. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ncpi_monthly <- cpi_daily |>\n mutate(\n date = floor_date(date, \"month\"),\n cpi = value / value[date == max(date)],\n .keep = \"none\"\n )\n```\n:::\n\n\nThe `tidyfinance` package can, of course, also fetch the same daily data and many more data series:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndownload_data(\n type = \"fred\",\n series = \"CPIAUCNS\",\n start_date = start_date,\n end_date = end_date\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 0 × 3\n# ℹ 3 variables: date , value , series \n```\n\n\n:::\n:::\n\n\nTo download other time series, we just have to look it up on the FRED website and extract the corresponding key from the address. For instance, the producer price index for gold ores can be found under the [PCU2122212122210](https://fred.stlouisfed.org/series/PCU2122212122210) key. If your desired time series is not supported through `tidyfinance`, we recommend working with the `fredr` package [@fredr]. Note that you need to get an API key to use its functionality. We refer to the package documentation for details.\n\n## Setting Up a Database\n\nNow that we have downloaded some (freely available) data from the web into the memory of our R session let us set up a database to store that information for future use. We will use the data stored in this database throughout the following chapters, but you could alternatively implement a different strategy and replace the respective code.\n\nThere are many ways to set up and organize a database, depending on the use case. For our purpose, the most efficient way is to use an [SQLite](https://www.sqlite.org/index.html) database, which is the C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine. Note that [SQL](https://en.wikipedia.org/wiki/SQL) (Structured Query Language) is a standard language for accessing and manipulating databases and heavily inspired the `dplyr` functions. We refer to [this tutorial](https://www.w3schools.com/sql/sql_intro.asp) for more information on SQL.\\index{Database!SQLite}\n\nThere are two packages that make working with SQLite in R very simple: `RSQLite` [@RSQLite] embeds the SQLite database engine in R, and `dbplyr` [@dbplyr] is the database back-end for `dplyr`. These packages allow to set up a database to remotely store tables and use these remote database tables as if they are in-memory data frames by automatically converting `dplyr` into SQL. Check out the [`RSQLite`](https://cran.r-project.org/web/packages/RSQLite/vignettes/RSQLite.html) and [`dbplyr`](https://db.rstudio.com/databases/sqlite/) vignettes for more information.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(RSQLite)\nlibrary(dbplyr)\n```\n:::\n\n\nAn SQLite database is easily created - the code below is really all there is. You do not need any external software. Note that we use the `extended_types = TRUE` option to enable date types when storing and fetching data. Otherwise, date columns are stored and retrieved as integers.\\index{Database!Creation} We will use the file `tidy_finance_r.sqlite`, located in the data subfolder, to retrieve data for all subsequent chapters. The initial part of the code ensures that the directory is created if it does not already exist.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nif (!dir.exists(\"data\")) {\n dir.create(\"data\")\n}\n\ntidy_finance <- dbConnect(\n SQLite(),\n \"data/tidy_finance_r.sqlite\",\n extended_types = TRUE\n)\n```\n:::\n\n\nNext, we create a remote table with the monthly Fama-French factor data. We do so with the function `dbWriteTable()`, which copies the data to our SQLite-database.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbWriteTable(\n tidy_finance,\n \"factors_ff3_monthly\",\n value = factors_ff3_monthly,\n overwrite = TRUE\n)\n```\n:::\n\n\nWe can use the remote table as an in-memory data frame by building a connection via `tbl()`.\\index{Database!Remote connection}\n\n\n::: {.cell}\n\n:::\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_monthly_db <- tbl(tidy_finance, \"factors_ff3_monthly\")\n```\n:::\n\n\nAll `dplyr` calls are evaluated lazily, i.e., the data is not in our R session's memory, and the database does most of the work. You can see that by noticing that the output below does not show the number of rows. In fact, the following code chunk only fetches the top 10 rows from the database for printing.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_monthly_db |>\n select(date, rf)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# Source: SQL [?? x 2]\n# Database: sqlite 3.47.1 [data/tidy_finance_r.sqlite]\n date rf\n \n1 1960-01-01 0.0033\n2 1960-02-01 0.0029\n3 1960-03-01 0.0035\n4 1960-04-01 0.0019\n5 1960-05-01 0.0027\n# ℹ more rows\n```\n\n\n:::\n:::\n\n\nIf we want to have the whole table in memory, we need to `collect()` it. You will see that we regularly load the data into the memory in the next chapters.\\index{Database!Fetch}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_monthly_db |>\n select(date, rf) |>\n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 780 × 2\n date rf\n \n1 1960-01-01 0.0033\n2 1960-02-01 0.0029\n3 1960-03-01 0.0035\n4 1960-04-01 0.0019\n5 1960-05-01 0.0027\n# ℹ 775 more rows\n```\n\n\n:::\n:::\n\n\nThe last couple of code chunks is really all there is to organizing a simple database! You can also share the SQLite database across devices and programming languages.\n\nBefore we move on to the next data source, let us also store the other five tables in our new SQLite database.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbWriteTable(\n tidy_finance,\n \"factors_ff5_monthly\",\n value = factors_ff5_monthly,\n overwrite = TRUE\n)\n\ndbWriteTable(\n tidy_finance,\n \"factors_ff3_daily\",\n value = factors_ff3_daily,\n overwrite = TRUE\n)\n\ndbWriteTable(\n tidy_finance,\n \"industries_ff_monthly\",\n value = industries_ff_monthly,\n overwrite = TRUE\n)\n\ndbWriteTable(\n tidy_finance,\n \"factors_q_monthly\",\n value = factors_q_monthly,\n overwrite = TRUE\n)\n\ndbWriteTable(\n tidy_finance,\n \"macro_predictors\",\n value = macro_predictors,\n overwrite = TRUE\n)\n\ndbWriteTable(\n tidy_finance,\n \"cpi_monthly\",\n value = cpi_monthly,\n overwrite = TRUE\n)\n```\n:::\n\n\nFrom now on, all you need to do to access data that is stored in the database is to follow three steps: (i) Establish the connection to the SQLite database, (ii) call the table you want to extract, and (iii) collect the data. For your convenience, the following steps show all you need in a compact fashion.\\index{Database!Connection}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\n\ntidy_finance <- dbConnect(\n SQLite(),\n \"data/tidy_finance_r.sqlite\",\n extended_types = TRUE\n)\n\nfactors_q_monthly <- tbl(tidy_finance, \"factors_q_monthly\")\nfactors_q_monthly <- factors_q_monthly |> collect()\n```\n:::\n\n\n## Managing SQLite Databases\n\nFinally, at the end of our data chapter, we revisit the SQLite database itself. When you drop database objects such as tables or delete data from tables, the database file size remains unchanged because SQLite just marks the deleted objects as free and reserves their space for future uses. As a result, the database file always grows in size.\\index{Database!Management}\n\nTo optimize the database file, you can run the `VACUUM` command in the database, which rebuilds the database and frees up unused space. You can execute the command in the database using the `dbSendQuery()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nres <- dbSendQuery(tidy_finance, \"VACUUM\")\nres\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n SQL VACUUM\n ROWS Fetched: 0 [complete]\n Changed: 0\n```\n\n\n:::\n:::\n\n\nThe `VACUUM` command actually performs a couple of additional cleaning steps, which you can read about in [this tutorial.](https://www.sqlitetutorial.net/sqlite-vacuum/) \\index{Database!Cleaning}\n\nWe store the result of the above query in `res` because the database keeps the result set open. To close open results and avoid warnings going forward, we can use `dbClearResult()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbClearResult(res)\n```\n:::\n\n\nApart from cleaning up, you might be interested in listing all the tables that are currently in your database. You can do this via the `dbListTables()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbListTables(tidy_finance)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"cpi_monthly\" \"factors_ff3_daily\" \n[3] \"factors_ff3_monthly\" \"factors_ff5_monthly\" \n[5] \"factors_q_monthly\" \"industries_ff_monthly\"\n[7] \"macro_predictors\" \n```\n\n\n:::\n:::\n\n\nThis function comes in handy if you are unsure about the correct naming of the tables in your database.\n\n## Key Takeaways\n\n- Importing Fama-French factors, q-factors, macroeconomic indicators, and CPI data is simplified through API calls, CSV parsing, and web scraping techniques.\n- The `tidyfinance` R package offers pre-processed access to financial datasets, reducing manual data cleaning and saving valuable time.\n- Creating a centralized SQLite database helps manage and organize data efficiently across projects, while maintaining reproducibility.\n- Structured database storage supports scalable data access, which is essential for long-term academic projects and collaborative work in finance.\n\n## Exercises\n\n1. Download the monthly Fama-French factors manually from [Ken French's data library](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html) and read them in via `read_csv()`. Validate that you get the same data as via the `frenchdata` package.\n2. Download the daily Fama-French 5 factors using the `frenchdata` package. Use `get_french_data_list()` to find the corresponding table name. After the successful download and conversion to the column format that we used above, compare the `rf`, `mkt_excess`, `smb`, and `hml` columns of `factors_ff3_daily` to `factors_ff5_daily`. Discuss any differences you might find.\n", + "markdown": "---\ntitle: Accessing and Managing Financial Data\naliases: \n - ../accessing-and-managing-financial-data.html\nmetadata:\n pagetitle: Accessing and Managing Financial Data with R\n description-meta: Download and organize open-source financial data using the programming language R. \n---\n\n::: callout-note\nYou are reading **Tidy Finance with R**. You can find the equivalent chapter for the sibling **Tidy Finance with Python** [here](../python/accessing-and-managing-financial-data.qmd).\n:::\n\nIn this chapter, we suggest a way to organize your financial data. Everybody who has experience with data is also familiar with storing data in various formats like CSV, XLS, XLSX, or other delimited value storage. Reading and saving data can become very cumbersome in the case of using different data formats, both across different projects and across different programming languages. Moreover, storing data in delimited files often leads to problems with respect to column type consistency. For instance, date-type columns frequently lead to inconsistencies across different data formats and programming languages.\n\nThis chapter shows how to import different open source data sets. Specifically, our data comes from the application programming interface (API) of Yahoo Finance, a downloaded standard CSV file, an XLSX file stored in a public Google Drive repository, and other macroeconomic time series that can be scraped directly from a website.\\index{API}\\index{Web scraping} We show how to process these raw data, as well as how to take a shortcut using the `tidyfinance` package, which provides a consistent interface to tidy financial data. We store all the data in a *single* database, which serves as the only source of data in subsequent chapters. We conclude the chapter by providing some tips on managing databases.\\index{Database}\n\nFirst, we load the global R packages that we use throughout this chapter. Later on, we load more packages in the sections where we need them.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(tidyfinance)\nlibrary(scales)\n```\n:::\n\n\nMoreover, we initially define the date range for which we fetch and store the financial data, making future data updates tractable. In case you need another time frame, you can adjust the dates below. Our data starts with 1960 since most asset pricing studies use data from 1962 on.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstart_date <- ymd(\"1960-01-01\")\nend_date <- ymd(\"2024-12-31\")\n```\n:::\n\n\n## Fama-French Data\n\nWe start by downloading some famous Fama-French factors [e.g., @Fama1993] and portfolio returns commonly used in empirical asset pricing. Fortunately, there is a neat package by [Nelson Areal](https://github.com/nareal/frenchdata/) that allows us to access the data easily: the `frenchdata` package provides functions to download and read data sets from [Prof. Kenneth French finance data library](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html) [@frenchdata].\\index{Data!Fama-French factors} \\index{Kenneth French homepage}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(frenchdata)\n```\n:::\n\n\nWe can use the `download_french_data()` function of the package to download monthly Fama-French factors. The set *Fama/French 3 Factors* contains the return time series of the market `mkt_excess`, size `smb` and value `hml` alongside the risk-free rates `rf`. Note that we have to do some manual work to correctly parse all the columns and scale them appropriately, as the raw Fama-French data comes in a very unpractical data format. For precise descriptions of the variables, we suggest consulting Prof. Kenneth French's finance data library directly. If you are on the website, check the raw data files to appreciate the time you can save thanks to `frenchdata`.\\index{Factor!Market}\\index{Factor!Size}\\index{Factor!Value}\\index{Factor!Profitability}\\index{Factor!Investment}\\index{Risk-free rate}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_monthly_raw <- download_french_data(\"Fama/French 3 Factors\")\nfactors_ff3_monthly <- factors_ff3_monthly_raw$subsets$data[[1]] |>\n mutate(\n date = floor_date(ymd(str_c(date, \"01\")), \"month\"),\n across(c(RF, `Mkt-RF`, SMB, HML), ~as.numeric(.) / 100),\n .keep = \"none\"\n ) |>\n rename_with(str_to_lower) |>\n rename(mkt_excess = `mkt-rf`) |> \n filter(date >= start_date & date <= end_date)\n```\n:::\n\n\nWe also download the set *5 Factors (2x3)*, which additionally includes the return time series of the profitability `rmw` and investment `cma` factors. We demonstrate how the monthly factors are constructed in the chapter [Replicating Fama and French Factors](replicating-fama-and-french-factors.qmd).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff5_monthly_raw <- download_french_data(\"Fama/French 5 Factors (2x3)\")\n\nfactors_ff5_monthly <- factors_ff5_monthly_raw$subsets$data[[1]] |>\n mutate(\n date = floor_date(ymd(str_c(date, \"01\")), \"month\"),\n across(c(RF, `Mkt-RF`, SMB, HML, RMW, CMA), ~as.numeric(.) / 100),\n .keep = \"none\"\n ) |>\n rename_with(str_to_lower) |>\n rename(mkt_excess = `mkt-rf`) |> \n filter(date >= start_date & date <= end_date)\n```\n:::\n\n\nIt is straightforward to download the corresponding *daily* Fama-French factors with the same function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_daily_raw <- download_french_data(\"Fama/French 3 Factors [Daily]\")\n\nfactors_ff3_daily <- factors_ff3_daily_raw$subsets$data[[1]] |>\n mutate(\n date = ymd(date),\n across(c(RF, `Mkt-RF`, SMB, HML), ~as.numeric(.) / 100),\n .keep = \"none\"\n ) |>\n rename_with(str_to_lower) |>\n rename(mkt_excess = `mkt-rf`) |>\n filter(date >= start_date & date <= end_date)\n```\n:::\n\n\nIn a subsequent chapter, we also use the 10 monthly industry portfolios, so let us fetch that data, too.\\index{Data!Industry portfolios}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nindustries_ff_monthly_raw <- download_french_data(\"10 Industry Portfolios\")\n\nindustries_ff_monthly <- industries_ff_monthly_raw$subsets$data[[1]] |>\n mutate(date = floor_date(ymd(str_c(date, \"01\")), \"month\")) |>\n mutate(across(where(is.numeric), ~ . / 100)) |>\n select(date, everything()) |>\n filter(date >= start_date & date <= end_date) |> \n rename_with(str_to_lower)\n```\n:::\n\n\nIt is worth taking a look at all available portfolio return time series from Kenneth French's homepage. You should check out the other sets by calling `get_french_data_list()`. \n\nTo automatically download and process Fama-French data, you can also use the `tidyfinance` package with `type = \"factors_ff_3_monthly\"` or similar, e.g.:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndownload_data(\n type = \"factors_ff_3_monthly\", \n start_date = start_date, \n end_date = end_date\n)\n```\n:::\n\n\nThe `tidyfinance` package implements the processing steps as above and returns the same cleaned data frame. The list of supported Fama-French data types can be called as follows:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlist_supported_types(domain = \"Fama-French\")\n```\n:::\n\n\n## q-Factors\n\nIn recent years, the academic discourse experienced the rise of alternative factor models, e.g., in the form of the @Hou2015 *q*-factor model. We refer to the [extended background](http://global-q.org/background.html) information provided by the original authors for further information. The *q* factors can be downloaded directly from the authors' homepage from within `read_csv()`.\\index{Data!q-factors}\\index{Factor!q-factors}\n\nWe also need to adjust this data. First, we discard information we will not use in the remainder of the book. Then, we rename the columns with the \"R\\_\"-prescript using regular expressions and write all column names in lowercase. You should always try sticking to a consistent style for naming objects, which we try to illustrate here - the emphasis is on *try*. You can check out style guides available online, e.g., [Hadley Wickham's `tidyverse` style guide.](https://style.tidyverse.org/index.html)\\index{Style guide}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_q_monthly_link <-\n \"https://global-q.org/uploads/1/2/2/6/122679606/q5_factors_monthly_2023.csv\"\n\nfactors_q_monthly <- read_csv(factors_q_monthly_link) |>\n mutate(date = ymd(str_c(year, month, \"01\", sep = \"-\"))) |>\n rename_with(~str_remove(., \"R_\")) |>\n rename_with(str_to_lower) |>\n mutate(across(-date, ~. / 100)) |>\n select(date, risk_free = f, mkt_excess = mkt, everything()) |>\n filter(date >= start_date & date <= end_date)\n```\n:::\n\n\nAgain, you can use the `tidyfinance` package for a shortcut:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndownload_data(\n type = \"factors_q5_monthly\", \n start_date = start_date, \n end_date = end_date\n)\n```\n:::\n\n\n## Macroeconomic Predictors\n\nOur next data source is a set of macroeconomic variables often used as predictors for the equity premium. @Goyal2008 comprehensively reexamine the performance of variables suggested by the academic literature to be good predictors of the equity premium. The authors host the data updated to 2022 on [Amit Goyal's website.](https://sites.google.com/view/agoyal145) The data is an XLSX-file stored on a public Google drive location and we directly export a CSV file.\\index{Data!Macro predictors}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsheet_id <- \"1bM7vCWd3WOt95Sf9qjLPZjoiafgF_8EG\"\nsheet_name <- \"Monthly\"\nmacro_predictors_url <- paste0(\n \"https://docs.google.com/spreadsheets/d/\", sheet_id,\n \"/gviz/tq?tqx=out:csv&sheet=\", sheet_name\n)\nmacro_predictors_raw <- read_csv(macro_predictors_url)\n```\n:::\n\n\nNext, we transform the columns into the variables that we later use:\n\n1. The dividend price ratio (`dp`), the difference between the log of dividends and the log of prices, where dividends are 12-month moving sums of dividends paid on the S&P 500 index, and prices are monthly averages of daily closing prices [@Campbell1988; @Campbell2006].\n2. Dividend yield (`dy`), the difference between the log of dividends and the log of lagged prices [@Ball1978].\n3. Earnings price ratio (`ep`), the difference between the log of earnings and the log of prices, where earnings are 12-month moving sums of earnings on the S&P 500 index [@Campbell1988].\n4. Dividend payout ratio (`de`), the difference between the log of dividends and the log of earnings [@Lamont1998].\n5. Stock variance (`svar`), the sum of squared daily returns on the S&P 500 index [@Guo2006].\n6. Book-to-market ratio (`bm`), the ratio of book value to market value for the Dow Jones Industrial Average [@Kothari1997].\n7. Net equity expansion (`ntis`), the ratio of 12-month moving sums of net issues by NYSE listed stocks divided by the total end-of-year market capitalization of NYSE stocks [@Campbell2008].\n8. Treasury bills (`tbl`), the 3-Month Treasury Bill: Secondary Market Rate from the economic research database at the Federal Reserve Bank at St. Louis [@Campbell1987].\n9. Long-term yield (`lty`), the long-term government bond yield from Ibbotson's Stocks, Bonds, Bills, and Inflation Yearbook [@Goyal2008].\n10. Long-term rate of returns (`ltr`), the long-term government bond returns from Ibbotson's Stocks, Bonds, Bills, and Inflation Yearbook [@Goyal2008].\n11. Term spread (`tms`), the difference between the long-term yield on government bonds and the Treasury bill [@Campbell1987].\n12. Default yield spread (`dfy`), the difference between BAA and AAA-rated corporate bond yields [@Fama1989].\n13. Inflation (`infl`), the Consumer Price Index (All Urban Consumers) from the Bureau of Labor Statistics [@Campbell2004].\n\nFor variable definitions and the required data transformations, you can consult the material on [Amit Goyal's website](https://sites.google.com/view/agoyal145).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmacro_predictors <- macro_predictors_raw |>\n mutate(date = ym(yyyymm)) |>\n mutate(across(where(is.character), as.numeric)) |>\n mutate(\n IndexDiv = Index + D12,\n logret = log(IndexDiv) - log(lag(IndexDiv)),\n Rfree = log(Rfree + 1),\n rp_div = lead(logret - Rfree, 1), # Future excess market return\n dp = log(D12) - log(Index), # Dividend Price ratio\n dy = log(D12) - log(lag(Index)), # Dividend yield\n ep = log(E12) - log(Index), # Earnings price ratio\n de = log(D12) - log(E12), # Dividend payout ratio\n tms = lty - tbl, # Term spread\n dfy = BAA - AAA # Default yield spread\n ) |>\n select(\n date, rp_div, dp, dy, ep, de, svar,\n bm = `b/m`, ntis, tbl, lty, ltr,\n tms, dfy, infl\n ) |>\n filter(date >= start_date & date <= end_date) |>\n drop_na()\n```\n:::\n\n\nTo get the equivalent data through `tidyfinance`, you can call:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndownload_data(\n type = \"macro_predictors_monthly\",\n start_date = start_date,\n end_date = end_date\n)\n```\n:::\n\n\n## Other Macroeconomic Data\n\nThe Federal Reserve bank of St. Louis provides the Federal Reserve Economic Data (FRED), an extensive database for macroeconomic data. In total, there are 817,000 US and international time series from 108 different sources. The data can be downloaded directly from FRED by constructing the appropriate URL. For instance, let us consider the consumer price index (CPI) data that can be found under the [CPIAUCNS](https://fred.stlouisfed.org/series/CPIAUCNS):\\index{Data!FRED}\\index{Data!CPI}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseries <- \"CPIAUCNS\"\ncpi_url <- paste0(\n \"https://fred.stlouisfed.org/graph/fredgraph.csv?id=\", series\n)\n```\n:::\n\n\nWe can then use the `httr2` [@httr2] package to request the CSV, extract the data from the response body, and convert the columns to a tidy format:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(httr2)\n\nresp <- request(cpi_url) |> \n req_perform()\nresp_csv <- resp |> \n resp_body_string() \n\ncpi_monthly <- resp_csv |> \n read_csv() |>\n mutate(\n date = as.Date(observation_date),\n value = as.numeric(.data[[series]]),\n series = series,\n .keep = \"none\"\n ) |>\n filter(date >= start_date & date <= end_date) |> \n mutate(\n cpi = value / value[date == max(date)]\n )\n```\n:::\n\n\nThe last line sets the current (latest) price level as the reference price level.\n\nThe `tidyfinance` package can, of course, also fetch the same index data and many more data series:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndownload_data(\n type = \"fred\",\n series = \"CPIAUCNS\",\n start_date = start_date,\n end_date = end_date\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 0 × 3\n# ℹ 3 variables: date , value , series \n```\n\n\n:::\n:::\n\n\nTo download other time series, we just have to look it up on the FRED website and extract the corresponding key from the address. For instance, the producer price index for gold ores can be found under the [PCU2122212122210](https://fred.stlouisfed.org/series/PCU2122212122210) key. If your desired time series is not supported through `tidyfinance`, we recommend working with the `fredr` package [@fredr]. Note that you need to get an API key to use its functionality. We refer to the package documentation for details.\n\n## Setting Up a Database\n\nNow that we have downloaded some (freely available) data from the web into the memory of our R session let us set up a database to store that information for future use. We will use the data stored in this database throughout the following chapters, but you could alternatively implement a different strategy and replace the respective code.\n\nThere are many ways to set up and organize a database, depending on the use case. For our purpose, the most efficient way is to use an [SQLite](https://www.sqlite.org/index.html) database, which is the C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine. Note that [SQL](https://en.wikipedia.org/wiki/SQL) (Structured Query Language) is a standard language for accessing and manipulating databases and heavily inspired the `dplyr` functions. We refer to [this tutorial](https://www.w3schools.com/sql/sql_intro.asp) for more information on SQL.\\index{Database!SQLite}\n\nThere are two packages that make working with SQLite in R very simple: `RSQLite` [@RSQLite] embeds the SQLite database engine in R, and `dbplyr` [@dbplyr] is the database back-end for `dplyr`. These packages allow to set up a database to remotely store tables and use these remote database tables as if they are in-memory data frames by automatically converting `dplyr` into SQL. Check out the [`RSQLite`](https://cran.r-project.org/web/packages/RSQLite/vignettes/RSQLite.html) and [`dbplyr`](https://db.rstudio.com/databases/sqlite/) vignettes for more information.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(RSQLite)\nlibrary(dbplyr)\n```\n:::\n\n\nAn SQLite database is easily created - the code below is really all there is. You do not need any external software. Note that we use the `extended_types = TRUE` option to enable date types when storing and fetching data. Otherwise, date columns are stored and retrieved as integers.\\index{Database!Creation} We will use the file `tidy_finance_r.sqlite`, located in the data subfolder, to retrieve data for all subsequent chapters. The initial part of the code ensures that the directory is created if it does not already exist.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nif (!dir.exists(\"data\")) {\n dir.create(\"data\")\n}\n\ntidy_finance <- dbConnect(\n SQLite(),\n \"data/tidy_finance_r.sqlite\",\n extended_types = TRUE\n)\n```\n:::\n\n\nNext, we create a remote table with the monthly Fama-French factor data. We do so with the function `dbWriteTable()`, which copies the data to our SQLite-database.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbWriteTable(\n tidy_finance,\n \"factors_ff3_monthly\",\n value = factors_ff3_monthly,\n overwrite = TRUE\n)\n```\n:::\n\n\nWe can use the remote table as an in-memory data frame by building a connection via `tbl()`.\\index{Database!Remote connection}\n\n\n::: {.cell}\n\n:::\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_monthly_db <- tbl(tidy_finance, \"factors_ff3_monthly\")\n```\n:::\n\n\nAll `dplyr` calls are evaluated lazily, i.e., the data is not in our R session's memory, and the database does most of the work. You can see that by noticing that the output below does not show the number of rows. In fact, the following code chunk only fetches the top 10 rows from the database for printing.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_monthly_db |>\n select(date, rf)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# Source: SQL [?? x 2]\n# Database: sqlite 3.47.1 [data/tidy_finance_r.sqlite]\n date rf\n \n1 1960-01-01 0.0033\n2 1960-02-01 0.0029\n3 1960-03-01 0.0035\n4 1960-04-01 0.0019\n5 1960-05-01 0.0027\n# ℹ more rows\n```\n\n\n:::\n:::\n\n\nIf we want to have the whole table in memory, we need to `collect()` it. You will see that we regularly load the data into the memory in the next chapters.\\index{Database!Fetch}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_monthly_db |>\n select(date, rf) |>\n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 780 × 2\n date rf\n \n1 1960-01-01 0.0033\n2 1960-02-01 0.0029\n3 1960-03-01 0.0035\n4 1960-04-01 0.0019\n5 1960-05-01 0.0027\n# ℹ 775 more rows\n```\n\n\n:::\n:::\n\n\nThe last couple of code chunks is really all there is to organizing a simple database! You can also share the SQLite database across devices and programming languages.\n\nBefore we move on to the next data source, let us also store the other five tables in our new SQLite database.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbWriteTable(\n tidy_finance,\n \"factors_ff5_monthly\",\n value = factors_ff5_monthly,\n overwrite = TRUE\n)\n\ndbWriteTable(\n tidy_finance,\n \"factors_ff3_daily\",\n value = factors_ff3_daily,\n overwrite = TRUE\n)\n\ndbWriteTable(\n tidy_finance,\n \"industries_ff_monthly\",\n value = industries_ff_monthly,\n overwrite = TRUE\n)\n\ndbWriteTable(\n tidy_finance,\n \"factors_q_monthly\",\n value = factors_q_monthly,\n overwrite = TRUE\n)\n\ndbWriteTable(\n tidy_finance,\n \"macro_predictors\",\n value = macro_predictors,\n overwrite = TRUE\n)\n\ndbWriteTable(\n tidy_finance,\n \"cpi_monthly\",\n value = cpi_monthly,\n overwrite = TRUE\n)\n```\n:::\n\n\nFrom now on, all you need to do to access data that is stored in the database is to follow three steps: (i) Establish the connection to the SQLite database, (ii) call the table you want to extract, and (iii) collect the data. For your convenience, the following steps show all you need in a compact fashion.\\index{Database!Connection}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\n\ntidy_finance <- dbConnect(\n SQLite(),\n \"data/tidy_finance_r.sqlite\",\n extended_types = TRUE\n)\n\nfactors_q_monthly <- tbl(tidy_finance, \"factors_q_monthly\")\nfactors_q_monthly <- factors_q_monthly |> collect()\n```\n:::\n\n\n## Managing SQLite Databases\n\nFinally, at the end of our data chapter, we revisit the SQLite database itself. When you drop database objects such as tables or delete data from tables, the database file size remains unchanged because SQLite just marks the deleted objects as free and reserves their space for future uses. As a result, the database file always grows in size.\\index{Database!Management}\n\nTo optimize the database file, you can run the `VACUUM` command in the database, which rebuilds the database and frees up unused space. You can execute the command in the database using the `dbSendQuery()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nres <- dbSendQuery(tidy_finance, \"VACUUM\")\nres\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n SQL VACUUM\n ROWS Fetched: 0 [complete]\n Changed: 0\n```\n\n\n:::\n:::\n\n\nThe `VACUUM` command actually performs a couple of additional cleaning steps, which you can read about in [this tutorial.](https://www.sqlitetutorial.net/sqlite-vacuum/) \\index{Database!Cleaning}\n\nWe store the result of the above query in `res` because the database keeps the result set open. To close open results and avoid warnings going forward, we can use `dbClearResult()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbClearResult(res)\n```\n:::\n\n\nApart from cleaning up, you might be interested in listing all the tables that are currently in your database. You can do this via the `dbListTables()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbListTables(tidy_finance)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] \"beta\" \"compustat\" \n [3] \"cpi_monthly\" \"crsp_daily\" \n [5] \"crsp_monthly\" \"factors_ff3_daily\" \n [7] \"factors_ff3_monthly\" \"factors_ff5_monthly\" \n [9] \"factors_q_monthly\" \"fisd\" \n[11] \"industries_ff_monthly\" \"macro_predictors\" \n[13] \"trace_enhanced\" \n```\n\n\n:::\n:::\n\n\nThis function comes in handy if you are unsure about the correct naming of the tables in your database.\n\n## Key Takeaways\n\n- Importing Fama-French factors, q-factors, macroeconomic indicators, and CPI data is simplified through API calls, CSV parsing, and web scraping techniques.\n- The `tidyfinance` R package offers pre-processed access to financial datasets, reducing manual data cleaning and saving valuable time.\n- Creating a centralized SQLite database helps manage and organize data efficiently across projects, while maintaining reproducibility.\n- Structured database storage supports scalable data access, which is essential for long-term academic projects and collaborative work in finance.\n\n## Exercises\n\n1. Download the monthly Fama-French factors manually from [Ken French's data library](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html) and read them in via `read_csv()`. Validate that you get the same data as via the `frenchdata` package.\n2. Download the daily Fama-French 5 factors using the `frenchdata` package. Use `get_french_data_list()` to find the corresponding table name. After the successful download and conversion to the column format that we used above, compare the `rf`, `mkt_excess`, `smb`, and `hml` columns of `factors_ff3_daily` to `factors_ff5_daily`. Discuss any differences you might find.\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/docs/accessing-and-managing-financial-data.html b/docs/accessing-and-managing-financial-data.html index 0aa4e119..2c36535a 100644 --- a/docs/accessing-and-managing-financial-data.html +++ b/docs/accessing-and-managing-financial-data.html @@ -6,6 +6,10 @@ var hash = window.location.hash.startsWith('#') ? window.location.hash.slice(1) : window.location.hash; var redirect = redirects[hash] || redirects[""] || "/"; window.document.title = 'Redirect to ' + redirect; + if (!redirects[hash]) { + redirect = redirect + window.location.hash; + } + redirect = redirect + window.location.search; window.location.replace(redirect); diff --git a/docs/python/accessing-and-managing-financial-data.html b/docs/python/accessing-and-managing-financial-data.html index 681f17c8..8e787a63 100644 --- a/docs/python/accessing-and-managing-financial-data.html +++ b/docs/python/accessing-and-managing-financial-data.html @@ -2,7 +2,7 @@ - + @@ -79,7 +79,7 @@ } - + @@ -92,14 +92,15 @@ + - + - + - + @@ -173,7 +174,7 @@ var macros = []; for (var i = 0; i < mathElements.length; i++) { var texText = mathElements[i].firstChild; - if (mathElements[i].tagName == "SPAN") { + if (mathElements[i].tagName == "SPAN" && texText && texText.data) { window.katex.render(texText.data, mathElements[i], { displayMode: mathElements[i].classList.contains('display'), throwOnError: false, @@ -204,7 +205,8 @@

Moreover, we initially define the date range for which we fetch and store the financial data, making future data updates tractable. In case you need another time frame, you can adjust the dates below. Our data starts with 1960 since most asset pricing studies use data from 1962 on.

-
-
start_date = "1960-01-01"
-end_date = "2024-12-31"
+
+
start_date = "1960-01-01"
+end_date = "2024-12-31"

Fama-French Data

-

We start by downloading some famous Fama-French factors (e.g., Fama and French 1993) and portfolio returns commonly used in empirical asset pricing. Fortunately, the pandas-datareader package provides a simple interface to read data from Kenneth French’s Data Library.

-
-
import pandas_datareader as pdr
+

We start by downloading some famous Fama-French factors (e.g., Fama and French 1993) and portfolio returns commonly used in empirical asset pricing. The data are freely available from Kenneth French’s Data Library, but the raw files come in a rather idiosyncratic format. If you access the data via the website, the manual raw workflow looks like this:

+
    +
  1. Go to the website
  2. +
  3. Find the right dataset
  4. +
  5. Download a ZIP file
  6. +
  7. Extract the CSV inside
  8. +
  9. Select the right data table from the file and import the table into Python
  10. +
  11. Clean the dates, scale the returns, fix column names, handle missing values, etc.
  12. +
+

Doing this once is fine; doing it repeatedly across projects is exactly the type of boilerplate that’s easy to mess up and annoying to maintain. It is therefore natural to automate these steps in Python.

+
+
+

From manual steps to a download script

+

A minimal download script mirrors the manual steps one by one. For example, to fetch a Fama–French dataset you first construct the URL:

+
+
dataset = "F-F_Research_Data_Factors"
+base_url = "http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/"
+url = f"{base_url}{dataset}_CSV.zip"
+
+

Next, you replace the browser download with an HTTP request and extract the ZIP in memory:

+
+
resp = requests.get(url)
+resp.raise_for_status()
+
+with zipfile.ZipFile(io.BytesIO(resp.content)) as zf:
+    file_name = zf.namelist()[0]  # Ken French ZIPs contain one file
+    raw_text = zf.read(file_name).decode("latin1")
+
+

The most important part of this chunk is the requests.get() call. This is the moment where we replace all the manual browser work (open the website, click download, save the file) with a single, reproducible line of code. Then, calling raise_for_status() ensures we stop immediately if the server returns an error (e.g. HTTP 404 or 500) instead of quietly handling a broken file. Once this succeeds, resp.content is guaranteed to contain valid ZIP bytes that we can open in memory.

+

The raw file contains documentation text followed by the actual data table(s). To emulate scrolling down until the numbers start, you can split the file into blocks and keep the long one that contains the table:

+
+
chunks = raw_text.split("\r\n\r\n")
+table_text = max(chunks, key=len)  
+
+

Within this block, the first CSV header line starts at the first line beginning with a comma. We add a “Date” label for the index and pass everything to read_csv:

+
+
match = re.search(r"^\s*,", table_text, flags=re.M)
+start = match.start()
+csv_text = "Date" + table_text[start:]
+
+factors_ff_raw = pd.read_csv(io.StringIO(csv_text), index_col=0)
-

We can use the pdr.DataReader() function of the package to download monthly Fama-French factors. The set Fama/French 3 Factors contains the return time series of the market (mkt_excess), size (smb), and value (hml) factors alongside the risk-free rates (rf). Note that we have to do some manual work to parse all the columns correctly and scale them appropriately, as the raw Fama-French data comes in a unique data format. For precise descriptions of the variables, we suggest consulting Prof. Kenneth French’s finance data library directly. If you are on the website, check the raw data files to appreciate the time you can save thanks topandas_datareader.

-
-
factors_ff3_monthly_raw = pdr.DataReader(
-  name="F-F_Research_Data_Factors",
-  data_source="famafrench", 
-  start=start_date, 
-  end=end_date)[0]
-
-factors_ff3_monthly = (factors_ff3_monthly_raw
-  .divide(100)
-  .reset_index(names="date")
-  .assign(date=lambda x: pd.to_datetime(x["date"].astype(str)))
-  .rename(str.lower, axis="columns")
-  .rename(columns={"mkt-rf": "mkt_excess"})
-)
+

At this point, the index still consists of integer date codes with different lengths depending on the frequency. We need a bit of logic to convert them into a proper DatetimeIndex:

+
+
s = factors_ff_raw.index.astype(str)
+
+if (s.str.len() == 8).all():  # daily: YYYYMMDD
+    dt = pd.to_datetime(s, format="%Y%m%d")
+elif (s.str.len() == 6).all():  # monthly: YYYYMM
+    dt = pd.to_datetime(s + "01", format="%Y%m%d")
+elif (s.str.len() == 4).all():  # annual: YYYY
+    dt = pd.to_datetime(s + "0101", format="%Y%m%d")
+    dt = dt.dt.to_period("A-DEC").dt.to_timestamp("end")
+else:
+    raise ValueError("Unknown date format in Fama–French index.")
+
+factors_ff_raw = factors_ff_raw.set_index(dt)
+factors_ff_raw.index.name = "date"
+
+

Finally, we still have to clean the data:

+
    +
  • Convert returns from percent to decimal.
  • +
  • Standardize column names (e.g., all lower case and Mkt-RF to mkt_excess, RF to risk_free)
  • +
  • Replace special missing-value codes (-99.99, -999) with actual missing values
  • +
  • Filter the data by a start and end date
  • +
+

This all could look like this:

+
+
# start and end dates
+if start_date:
+    factors_ff_raw = factors_ff_raw[factors_ff_raw.index >= pd.to_datetime(start_date)]
+if end_date:
+    factors_ff_raw = factors_ff_raw[factors_ff_raw.index <= pd.to_datetime(end_date)]
+
+factors_ff3_monthly = (factors_ff_raw
+    .div(100)
+    .reset_index(names="date")
+    .rename(columns=str.lower)
+    .rename(columns={"mkt-rf": "mkt_excess", "rf": "risk_free"})
+    .replace({"-99.99": pd.NA, -99.99: pd.NA, -999: pd.NA})
+)
+factors_ff3_monthly
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
datemkt_excesssmbhmlrisk_free
01960-01-01-0.06980.02120.02650.0033
11960-02-010.01160.0060-0.01970.0029
21960-03-01-0.0163-0.0055-0.02750.0035
31960-04-01-0.01710.0022-0.02140.0019
41960-05-010.03120.0129-0.03730.0027
..................
7752024-08-010.0160-0.0349-0.01100.0048
7762024-09-010.0172-0.0013-0.02770.0040
7772024-10-01-0.0100-0.00990.00860.0039
7782024-11-010.06490.04460.00150.0040
7792024-12-01-0.0317-0.0271-0.03000.0037
+ +

780 rows × 5 columns

+
+
+
+

All of these steps are doable, but none of them are really about finance - they are just the technical scaffolding required before you can work with the actual factor returns. That’s where a dedicated helper or package becomes invaluable. The tidyfinance package performs this entire workflow under the hood: you request a Fama–French dataset and receive a clean, consistently formatted data table from Kenneth French’s Data Library.. This avoids repetitive boilerplate, reduces errors, and lets you focus on modeling and analysis rather than on data plumbing.

+
+
+

Using tidyfinance instead of reimplementing the plumbing

+
+
import tidyfinance as tf
+
+

For example, we can use the tf.download_data() function of the package to download monthly Fama-French factors. The set Fama/French 3 Factors contains the return time series of the market (mkt_excess), size (smb), and value (hml) factors alongside the risk-free rates (risk_free). Note that the tf.download_data() function parses all the columns correctly and already scale them appropriately, as the raw Fama-French data comes in a unique data format. For precise descriptions of the variables, we suggest consulting Prof. Kenneth French’s finance data library directly. If you are on the website, check the raw data files to appreciate the time you can save thanks to the tidyfinance package.

+
+
factors_ff3_monthly = tf.download_data(
+  domain="famafrench",
+  dataset="F-F_Research_Data_Factors",
+  start_date=start_date,
+  end_date=end_date,
+)

We also download the set 5 Factors (2x3), which additionally includes the return time series of the profitability (rmw) and investment (cma) factors. We demonstrate how the monthly factors are constructed in Replicating Fama and French Factors.

-
-
factors_ff5_monthly_raw = pdr.DataReader(
-  name="F-F_Research_Data_5_Factors_2x3",
-  data_source="famafrench", 
-  start=start_date, 
-  end=end_date)[0]
-
-factors_ff5_monthly = (factors_ff5_monthly_raw
-  .divide(100)
-  .reset_index(names="date")
-  .assign(date=lambda x: pd.to_datetime(x["date"].astype(str)))
-  .rename(str.lower, axis="columns")
-  .rename(columns={"mkt-rf": "mkt_excess"})
-)
+
+
factors_ff5_monthly = tf.download_data(
+  domain="famafrench",
+  dataset="F-F_Research_Data_5_Factors_2x3",
+  start_date=start_date,
+  end_date=end_date,
+)

It is straightforward to download the corresponding daily Fama-French factors with the same function.

-
-
factors_ff3_daily_raw = pdr.DataReader(
-  name="F-F_Research_Data_Factors_daily",
-  data_source="famafrench", 
-  start=start_date, 
-  end=end_date)[0]
-
-factors_ff3_daily = (factors_ff3_daily_raw
-  .divide(100)
-  .reset_index(names="date")
-  .rename(str.lower, axis="columns")
-  .rename(columns={"mkt-rf": "mkt_excess"})
-)
+
+
factors_ff3_daily = tf.download_data(
+  domain="famafrench",
+  dataset="F-F_Research_Data_Factors_daily",
+  start_date=start_date,
+  end_date=end_date,
+)

In a subsequent chapter, we also use the monthly returns from ten industry portfolios, so let us fetch that data, too.

-
-
industries_ff_monthly_raw = pdr.DataReader(
-  name="10_Industry_Portfolios",
-  data_source="famafrench", 
-  start=start_date, 
-  end=end_date)[0]
-
-industries_ff_monthly = (industries_ff_monthly_raw
-  .divide(100)
-  .reset_index(names="date")
-  .assign(date=lambda x: pd.to_datetime(x["date"].astype(str)))
-  .rename(str.lower, axis="columns")
-)
-
-

It is worth taking a look at all available portfolio return time series from Kenneth French’s homepage. You should check out the other sets by calling pdr.famafrench.get_available_datasets().

-

To automatically download and process Fama-French data, you can also use the tidyfinance package with domain="factors_ff" and the corresponding dataset, e.g.:

-
-
tf.download_data(
-  domain="factors_ff",
-  dataset="F-F_Research_Data_Factors", 
-  start_date=start_date, 
-  end_date=end_date
-)
+
+
industries_ff_monthly = tf.download_data(
+  domain="famafrench",
+  dataset="10_Industry_Portfolios",
+  start_date=start_date,
+  end_date=end_date,
+)
-

The tidyfinance package implements the processing steps as above and returns the same cleaned data frame.

-
+

It is worth taking a look at all available portfolio return time series from Kenneth French’s homepage. You should check out the other sets by calling tf.get_available_famafrench_datasets().

q-Factors

In recent years, the academic discourse experienced the rise of alternative factor models, e.g., in the form of the Hou, Xue, and Zhang (2014) q-factor model. We refer to the extended background information provided by the original authors for further information. The q-factors can be downloaded directly from the authors’ homepage from within pd.read_csv().

We also need to adjust this data. First, we discard information we will not use in the remainder of the book. Then, we rename the columns with the “R_”-prescript using regular expressions and write all column names in lowercase. We then query the data to select observations between the start and end dates. Finally, we use the double asterisk (**) notation in the assign function to apply the same transform of dividing by 100 to all four factors by iterating through them. You should always try sticking to a consistent style for naming objects, which we try to illustrate here - the emphasis is on try. You can check out style guides available online, e.g., Hadley Wickham’s tidyverse style guide. note that we temporarily adjust the SSL certificate handling behavior in Python’s ssl module when retrieving the \(q\)-factors directly from the web, as demonstrated in Working with Stock Returns. This method should be used with caution, which is why we restore the default settings immediately after successfully downloading the data.

-
-
import ssl
-ssl._create_default_https_context = ssl._create_unverified_context
-
-factors_q_monthly_link = (
-  "https://global-q.org/uploads/1/2/2/6/122679606/"
-  "q5_factors_monthly_2024.csv"
-)
-
-factors_q_monthly = (pd.read_csv(factors_q_monthly_link)
-  .assign(
-    date=lambda x: (
-      pd.to_datetime(x["year"].astype(str) + "-" +
-        x["month"].astype(str) + "-01"))
-  )
-  .drop(columns=["R_F", "R_MKT", "year"])
-  .rename(columns=lambda x: x.replace("R_", "").lower())
-  .query(f"date >= '{start_date}' and date <= '{end_date}'")
-  .assign(
-    **{col: lambda x: x[col]/100 for col in ["me", "ia", "roe", "eg"]}
-  )
-)
-
-ssl._create_default_https_context = ssl.create_default_context
+
+
import ssl
+ssl._create_default_https_context = ssl._create_unverified_context
+
+factors_q_monthly_link = (
+  "https://global-q.org/uploads/1/2/2/6/122679606/"
+  "q5_factors_monthly_2024.csv"
+)
+
+factors_q_monthly = (pd.read_csv(factors_q_monthly_link)
+  .assign(
+    date=lambda x: (
+      pd.to_datetime(x["year"].astype(str) + "-" +
+        x["month"].astype(str) + "-01"))
+  )
+  .drop(columns=["R_F", "R_MKT", "year"])
+  .rename(columns=lambda x: x.replace("R_", "").lower())
+  .query(f"date >= '{start_date}' and date <= '{end_date}'")
+  .assign(
+    **{col: lambda x: x[col]/100 for col in ["me", "ia", "roe", "eg"]}
+  )
+)
+
+ssl._create_default_https_context = ssl.create_default_context

Again, you can use the tidyfinance package for a shortcut:

-
-
tf.download_data(
-  domain="factors_q",
-  dataset="q5_factors_monthly", 
-  start_date=start_date, 
-  end_date=end_date
-)
-
+
+
tf.download_data(
+  domain="factors_q",
+  dataset="q5_factors_monthly", 
+  start_date=start_date, 
+  end_date=end_date
+)
+
- +
@@ -748,7 +920,7 @@

q-Factors

- + @@ -758,7 +930,7 @@

q-Factors

- + @@ -768,7 +940,7 @@

q-Factors

- + @@ -778,7 +950,7 @@

q-Factors

- + @@ -788,7 +960,7 @@

q-Factors

- + @@ -798,7 +970,7 @@

q-Factors

- + @@ -808,7 +980,7 @@

q-Factors

- + @@ -818,7 +990,7 @@

q-Factors

- + @@ -828,7 +1000,7 @@

q-Factors

- + @@ -838,7 +1010,7 @@

q-Factors

- + @@ -848,7 +1020,7 @@

q-Factors

- + @@ -868,13 +1040,13 @@

q-Factors

Macroeconomic Predictors

Our next data source is a set of macroeconomic variables often used as predictors for the equity premium. Welch and Goyal (2008) comprehensively reexamine the performance of variables suggested by the academic literature to be good predictors of the equity premium. The authors host the data on Amit Goyal’s website. Since the data is an XLSX-file stored on a public Google Drive location, we need additional packages to access the data directly from our Python session. Usually, you need to authenticate if you interact with Google drive directly in Python. Since the data is stored via a public link, we can proceed without any authentication.

-
-
sheet_id = "1bM7vCWd3WOt95Sf9qjLPZjoiafgF_8EG"
-sheet_name = "macro_predictors.xlsx"
-macro_predictors_link = (
-  f"https://docs.google.com/spreadsheets/d/{sheet_id}" 
-  f"/gviz/tq?tqx=out:csv&sheet={sheet_name}"
-)
+
+
sheet_id = "1bM7vCWd3WOt95Sf9qjLPZjoiafgF_8EG"
+sheet_name = "macro_predictors.xlsx"
+macro_predictors_link = (
+  f"https://docs.google.com/spreadsheets/d/{sheet_id}" 
+  f"/gviz/tq?tqx=out:csv&sheet={sheet_name}"
+)

Next, we read in the new data and transform the columns into the variables that we later use:

    @@ -893,84 +1065,157 @@

    Macroeconomic Pre
  1. Inflation (infl), the Consumer Price Index (All Urban Consumers) from the Bureau of Labor Statistics (Campbell and Vuolteenaho 2004).

For variable definitions and the required data transformations, you can consult the material on Amit Goyal’s website.

-
-
ssl._create_default_https_context = ssl._create_unverified_context
-
-macro_predictors = (
-  pd.read_csv(macro_predictors_link, thousands=",")
-  .assign(
-    date=lambda x: pd.to_datetime(x["yyyymm"], format="%Y%m"),
-    dp=lambda x: np.log(x["D12"])-np.log(x["Index"]),
-    dy=lambda x: np.log(x["D12"])-np.log(x["Index"].shift(1)),
-    ep=lambda x: np.log(x["E12"])-np.log(x["Index"]),
-    de=lambda x: np.log(x["D12"])-np.log(x["E12"]),
-    tms=lambda x: x["lty"]-x["tbl"],
-    dfy=lambda x: x["BAA"]-x["AAA"]
-  )
-  .rename(columns={"b/m": "bm"})
-  .get(["date", "dp", "dy", "ep", "de", "svar", "bm", 
-        "ntis", "tbl", "lty", "ltr", "tms", "dfy", "infl"])
-  .query("date >= @start_date and date <= @end_date")
-  .dropna()
-)
-
-ssl._create_default_https_context = ssl.create_default_context
+
+
ssl._create_default_https_context = ssl._create_unverified_context
+
+macro_predictors = (
+  pd.read_csv(macro_predictors_link, thousands=",")
+  .assign(
+    date=lambda x: pd.to_datetime(x["yyyymm"], format="%Y%m"),
+    dp=lambda x: np.log(x["D12"])-np.log(x["Index"]),
+    dy=lambda x: np.log(x["D12"])-np.log(x["Index"].shift(1)),
+    ep=lambda x: np.log(x["E12"])-np.log(x["Index"]),
+    de=lambda x: np.log(x["D12"])-np.log(x["E12"]),
+    tms=lambda x: x["lty"]-x["tbl"],
+    dfy=lambda x: x["BAA"]-x["AAA"]
+  )
+  .rename(columns={"b/m": "bm"})
+  .get(["date", "dp", "dy", "ep", "de", "svar", "bm", 
+        "ntis", "tbl", "lty", "ltr", "tms", "dfy", "infl"])
+  .query("date >= @start_date and date <= @end_date")
+  .dropna()
+)
+
+ssl._create_default_https_context = ssl.create_default_context

To get the equivalent data through tidyfinance, you can call:

-
-
tf.download_data(
-  domain="macro_predictors",
-  dataset="monthly",
-  start_date=start_date, 
-  end_date=end_date
-)
+
+
tf.download_data(
+  domain="macro_predictors",
+  dataset="monthly",
+  start_date=start_date, 
+  end_date=end_date
+)

Other Macroeconomic Data

-

The Federal Reserve bank of St. Louis provides the Federal Reserve Economic Data (FRED), an extensive database for macroeconomic data. In total, there are 817,000 US and international time series from 108 different sources. As an illustration, we use the already familiar pandas-datareader package to fetch consumer price index (CPI) data that can be found under the CPIAUCNS key.

-
-
cpi_monthly = (pdr.DataReader(
-    name="CPIAUCNS", 
-    data_source="fred", 
-    start=start_date, 
-    end=end_date
-  )
-  .reset_index(names="date")
-  .rename(columns={"CPIAUCNS": "cpi"})
-  .assign(cpi=lambda x: x["cpi"] / x["cpi"].iloc[-1])
-)
+

The Federal Reserve bank of St. Louis provides the Federal Reserve Economic Data (FRED), an extensive database for macroeconomic data. In total, there are 817,000 US and international time series from 108 different sources. As an illustration, we use the tidyfinance package to fetch consumer price index (CPI) data that can be found under the CPIAUCNS key.

+
+
series = "CPIAUCNS"
+url = f"https://fred.stlouisfed.org/graph/fredgraph.csv?id={series}"
-

Note that we use the assign() in the last line to set the current (latest) price level as the reference inflation level. To download other time series, we just have to look it up on the FRED website and extract the corresponding key from the address. For instance, the producer price index for gold ores can be found under the PCU2122212122210 key.

-

The tidyfinance package can, of course, also fetch the same daily data and many more data series:

-
-
tf.download_data(
-  domain="fred",
-  series="CPIAUCNS", 
-  start_date=start_date, 
-  end_date=end_date
-)
-
-
Failed to retrieve data for series CPIAUCNS: Failed to perform, curl: (6) Could not resolve host: https. See https://curl.se/libcurl/c/libcurl-errors.html first for more details.
-Failed to retrieve data for series CPIAUCNS: 'date'
+

We can then use the requests module to request the CSV, extract the data from the response body, and convert the columns to a tidy format:

+
+
resp = requests.get(url)
+resp_csv = pd.io.common.StringIO(resp.text)
+
+cpi_monthly = (pd.read_csv(resp_csv)
+  .assign(
+    date=lambda x: pd.to_datetime(x["observation_date"]),
+    value=lambda x: pd.to_numeric(
+      x[series], errors="coerce"
+    ),
+      series=series,
+   )
+  .get(["date", "series", "value"])
+  .query("date >= @start_date & date <= @end_date")
+  .assign(cpi=lambda x: x["value"] / x["value"].iloc[-1])
+)
-
+

The last line sets the current (latest) price level as the reference price level.

+

The tidyfinance package can, of course, also fetch the same index data and many more data series:

+
+
tf.download_data(
+  domain="fred",
+  series = "CPIAUCNS",
+  start_date = start_date,
+  end_date = end_date
+)
+
-
00 1967-01-01 0.003927 0.081852 -0.025511
11 1967-02-01 0.003743 0.007557 0.021792
22 1967-03-01 0.003693 0.040169 -0.011192
33 1967-04-01 0.003344 0.038786 -0.016371
44 1967-05-01 0.003126 -0.042807 0.001191
...... ... ... ... ...
691691 2024-08-01 0.004419 0.016518 0.008116
692692 2024-09-01 0.004619 0.016806 -0.032810
693693 2024-10-01 0.003907 -0.009701 -0.008335
694694 2024-11-01 0.003955 0.065002 -0.021420
695695 2024-12-01 0.003663 -0.031637
+
- + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
datevalue seriesvalue
01960-01-01CPIAUCNS29.300
11960-02-01CPIAUCNS29.400
21960-03-01CPIAUCNS29.400
31960-04-01CPIAUCNS29.500
41960-05-01CPIAUCNS29.500
............
7752024-08-01CPIAUCNS314.796
7762024-09-01CPIAUCNS315.301
7772024-10-01CPIAUCNS315.664
7782024-11-01CPIAUCNS315.493
7792024-12-01CPIAUCNS315.605
+

780 rows × 3 columns

@@ -980,99 +1225,99 @@

Other Macroeconom

Setting Up a Database

Now that we have downloaded some (freely available) data from the web into the memory of our Python session, let us set up a database to store that information for future use. We will use the data stored in this database throughout the following chapters, but you could alternatively implement a different strategy and replace the respective code.

There are many ways to set up and organize a database, depending on the use case. For our purpose, the most efficient way is to use an SQLite-database, which is the C-language library that implements a small, fast, self-contained, high-reliability, full-featured SQL database engine. Note that SQL (Structured Query Language) is a standard language for accessing and manipulating databases.

-
-
import sqlite3
+
+
import sqlite3

An SQLite-database is easily created - the code below is really all there is. You do not need any external software. Otherwise, date columns are stored and retrieved as integers. We will use the file tidy_finance_r.sqlite, located in the data subfolder, to retrieve data for all subsequent chapters. The initial part of the code ensures that the directory is created if it does not already exist.

-
-
import os
-
-if not os.path.exists("data"):
-  os.makedirs("data")
-    
-tidy_finance = sqlite3.connect(database="data/tidy_finance_python.sqlite")
+
+
import os
+
+if not os.path.exists("data"):
+  os.makedirs("data")
+    
+tidy_finance = sqlite3.connect(database="data/tidy_finance_python.sqlite")

Next, we create a remote table with the monthly Fama-French factor data. We do so with the pandas function to_sql(), which copies the data to our SQLite-database.

-
-
(factors_ff3_monthly
-  .to_sql(name="factors_ff3_monthly", 
-          con=tidy_finance, 
-          if_exists="replace",
-          index=False)
-)
+
+
(factors_ff3_monthly
+  .to_sql(name="factors_ff3_monthly", 
+          con=tidy_finance, 
+          if_exists="replace",
+          index=False)
+)

Now, if we want to have the whole table in memory, we need to call pd.read_sql_query() with the corresponding query. You will see that we regularly load the data into the memory in the next chapters.

-
-
pd.read_sql_query(
-  sql="SELECT date, rf FROM factors_ff3_monthly",
-  con=tidy_finance,
-  parse_dates={"date"}
-)
-
+
+
pd.read_sql_query(
+  sql="SELECT date, risk_free FROM factors_ff3_monthly",
+  con=tidy_finance,
+  parse_dates={"date"}
+)
+
- +
- + - + - + - + - + - + - + - + - + - + - + - + @@ -1085,42 +1330,42 @@

Setting Up a Databas

The last couple of code chunks are really all there is to organizing a simple database! You can also share the SQLite database across devices and programming languages.

Before we move on to the next data source, let us also store the other six tables in our new SQLite database.

-
-
data_dict = {
-  "factors_ff5_monthly": factors_ff5_monthly,
-  "factors_ff3_daily": factors_ff3_daily,
-  "industries_ff_monthly": industries_ff_monthly, 
-  "factors_q_monthly": factors_q_monthly,
-  "macro_predictors": macro_predictors,
-  "cpi_monthly": cpi_monthly
-}
-
-for key, value in data_dict.items():
-    value.to_sql(name=key,
-                 con=tidy_finance, 
-                 if_exists="replace",
-                 index=False)
+
+
data_dict = {
+  "factors_ff5_monthly": factors_ff5_monthly,
+  "factors_ff3_daily": factors_ff3_daily,
+  "industries_ff_monthly": industries_ff_monthly, 
+  "factors_q_monthly": factors_q_monthly,
+  "macro_predictors": macro_predictors,
+  "cpi_monthly": cpi_monthly
+}
+
+for key, value in data_dict.items():
+    value.to_sql(name=key,
+                 con=tidy_finance, 
+                 if_exists="replace",
+                 index=False)

From now on, all you need to do to access data that is stored in the database is to follow two steps: (i) Establish the connection to the SQLite-database and (ii) execute the query to fetch the data. For your convenience, the following steps show all you need in a compact fashion.

-
-
import pandas as pd
-import sqlite3
-
-tidy_finance = sqlite3.connect(database="data/tidy_finance_python.sqlite")
-
-factors_q_monthly = pd.read_sql_query(
-  sql="SELECT * FROM factors_q_monthly",
-  con=tidy_finance,
-  parse_dates={"date"}
-)
+
+
import pandas as pd
+import sqlite3
+
+tidy_finance = sqlite3.connect(database="data/tidy_finance_python.sqlite")
+
+factors_q_monthly = pd.read_sql_query(
+  sql="SELECT * FROM factors_q_monthly",
+  con=tidy_finance,
+  parse_dates={"date"}
+)

Managing SQLite Databases

Finally, at the end of our data chapter, we revisit the SQLite database itself. When you drop database objects such as tables or delete data from tables, the database file size remains unchanged because SQLite just marks the deleted objects as free and reserves their space for future uses. As a result, the database file always grows in size.

To optimize the database file, you can run the VACUUM command in the database, which rebuilds the database and frees up unused space. You can execute the command in the database using the execute() function.

-
-
tidy_finance.execute("VACUUM")
+
+
tidy_finance.execute("VACUUM")

The VACUUM command actually performs a couple of additional cleaning steps, which you can read about in this tutorial.

@@ -1136,12 +1381,13 @@

Key Takeaways

Exercises

    -
  1. Download the monthly Fama-French factors manually from Kenneth French’s data library and read them in via pd.read_csv(). Validate that you get the same data as via the pandas-datareader package.
  2. -
  3. Download the daily Fama-French 5 factors using the pdr.DataReader() package. After the successful download and conversion to the column format that we used above, compare the rf, mkt_excess, smb, and hml columns of factors_ff3_daily to factors_ff5_daily. Discuss any differences you might find.
  4. +
  5. Download the monthly Fama-French factors manually from Kenneth French’s data library and read them in via pd.read_csv(). Validate that you get the same data as via the tf.download_data() package.
  6. +
  7. Download the daily Fama-French 5 factors using the tf.download_data() function. After the successful download and conversion to the column format that we used above, compare the risk_free, mkt_excess, smb, and hml columns of factors_ff3_daily to factors_ff5_daily. Discuss any differences you might find.
+

References

@@ -1241,13 +1487,14 @@

Exercises

e.clearSelection(); } const getTextToCopy = function(trigger) { - const codeEl = trigger.previousElementSibling.cloneNode(true); - for (const childEl of codeEl.children) { - if (isCodeAnnotation(childEl)) { - childEl.remove(); - } + const outerScaffold = trigger.parentElement.cloneNode(true); + const codeEl = outerScaffold.querySelector('code'); + for (const childEl of codeEl.children) { + if (isCodeAnnotation(childEl)) { + childEl.remove(); } - return codeEl.innerText; + } + return codeEl.innerText; } const clipboard = new window.ClipboardJS('.code-copy-button:not([data-in-quarto-modal])', { text: getTextToCopy diff --git a/docs/r/accessing-and-managing-financial-data.html b/docs/r/accessing-and-managing-financial-data.html index 656de73e..3b4577e5 100644 --- a/docs/r/accessing-and-managing-financial-data.html +++ b/docs/r/accessing-and-managing-financial-data.html @@ -2,7 +2,7 @@ - + @@ -92,14 +92,15 @@ + - + - +

daterfrisk_free
00 1960-01-01 0.0033
11 1960-02-01 0.0029
22 1960-03-01 0.0035
33 1960-04-01 0.0019
44 1960-05-01 0.0027
...... ... ...
775775 2024-08-01 0.0048
776776 2024-09-01 0.0040
777777 2024-10-01 0.0039
778778 2024-11-01 0.0040
779779 2024-12-01 0.0037