diff --git a/_freeze/python/accessing-and-managing-financial-data/execute-results/html.json b/_freeze/python/accessing-and-managing-financial-data/execute-results/html.json
index 1e822669..4978b5db 100644
--- a/_freeze/python/accessing-and-managing-financial-data/execute-results/html.json
+++ b/_freeze/python/accessing-and-managing-financial-data/execute-results/html.json
@@ -1,15 +1,15 @@
{
- "hash": "9fd328b50bbd06951670e009c36376c7",
+ "hash": "7f6920c8504bdf5ab2869b676468fe9a",
"result": {
"engine": "jupyter",
- "markdown": "---\ntitle: Accessing and Managing Financial Data\nmetadata:\n pagetitle: Accessing and Managing Financial Data with Python\n description-meta: Download and organize open-source financial data using the programming language Python. \n---\n\n\n\n::: callout-note\nYou are reading **Tidy Finance with Python**. You can find the equivalent chapter for the sibling **Tidy Finance with R** [here](../r/accessing-and-managing-financial-data.qmd).\n:::\n\nIn this chapter, we suggest a way to organize your financial data. Everybody who has experience with data is also familiar with storing data in various formats like CSV, XLS, XLSX, or other delimited value storage. Reading and saving data can become very cumbersome when using different data formats and across different projects. Moreover, storing data in delimited files often leads to problems with respect to column type consistency. For instance, date-type columns frequently lead to inconsistencies across different data formats and programming languages. \n\nThis chapter shows how to import different open-source datasets. Specifically, our data comes from the application programming interface (API) of Yahoo Finance, a downloaded standard CSV file, an XLSX file stored in a public Google Drive repository, and other macroeconomic time series.\\index{API} We store all the data in a *single* database, which serves as the only source of data in subsequent chapters. We conclude the chapter by providing some tips on managing databases.\\index{Database}\n\nFirst, we load the Python packages that we use throughout this chapter. Later on, we load more packages in the sections where we need them. \n\n::: {#52142987 .cell execution_count=2}\n``` {.python .cell-code}\nimport pandas as pd\nimport numpy as np\nimport tidyfinance as tf\n```\n:::\n\n\nMoreover, we initially define the date range for which we fetch and store the financial data, making future data updates tractable. In case you need another time frame, you can adjust the dates below. Our data starts with 1960 since most asset pricing studies use data from 1962 on.\n\n::: {#dd94c04a .cell execution_count=3}\n``` {.python .cell-code}\nstart_date = \"1960-01-01\"\nend_date = \"2024-12-31\"\n```\n:::\n\n\n## Fama-French Data\n\nWe start by downloading some famous Fama-French factors [e.g., @Fama1993] and portfolio returns commonly used in empirical asset pricing. Fortunately, the `pandas-datareader` package provides a simple interface to read data from Kenneth French's Data Library.\\index{Data!Fama-French factors}\\index{Kenneth French homepage}\n\n::: {#4b8d1ab2 .cell execution_count=4}\n``` {.python .cell-code}\nimport pandas_datareader as pdr\n```\n:::\n\n\nWe can use the `pdr.DataReader()` function of the package to download monthly Fama-French factors. The set *Fama/French 3 Factors* contains the return time series of the market (`mkt_excess`), size (`smb`), and value (`hml`) factors alongside the risk-free rates (`rf`). Note that we have to do some manual work to parse all the columns correctly and scale them appropriately, as the raw Fama-French data comes in a unique data format. For precise descriptions of the variables, we suggest consulting Prof. Kenneth French's finance data library directly. If you are on the website, check the raw data files to appreciate the time you can save thanks to`pandas_datareader`.\\index{Factor!Market}\\index{Factor!Size}\\index{Factor!Value}\\index{Factor!Profitability}\\index{Factor!Investment}\\index{Risk-free rate}\n\n::: {#dbee5fef .cell execution_count=5}\n``` {.python .cell-code}\nfactors_ff3_monthly_raw = pdr.DataReader(\n name=\"F-F_Research_Data_Factors\",\n data_source=\"famafrench\", \n start=start_date, \n end=end_date)[0]\n\nfactors_ff3_monthly = (factors_ff3_monthly_raw\n .divide(100)\n .reset_index(names=\"date\")\n .assign(date=lambda x: pd.to_datetime(x[\"date\"].astype(str)))\n .rename(str.lower, axis=\"columns\")\n .rename(columns={\"mkt-rf\": \"mkt_excess\"})\n)\n```\n:::\n\n\nWe also download the set *5 Factors (2x3)*, which additionally includes the return time series of the profitability (`rmw`) and investment (`cma`) factors. We demonstrate how the monthly factors are constructed in [Replicating Fama and French Factors](replicating-fama-and-french-factors.qmd).\n\n::: {#9e9e1781 .cell execution_count=6}\n``` {.python .cell-code}\nfactors_ff5_monthly_raw = pdr.DataReader(\n name=\"F-F_Research_Data_5_Factors_2x3\",\n data_source=\"famafrench\", \n start=start_date, \n end=end_date)[0]\n\nfactors_ff5_monthly = (factors_ff5_monthly_raw\n .divide(100)\n .reset_index(names=\"date\")\n .assign(date=lambda x: pd.to_datetime(x[\"date\"].astype(str)))\n .rename(str.lower, axis=\"columns\")\n .rename(columns={\"mkt-rf\": \"mkt_excess\"})\n)\n```\n:::\n\n\nIt is straightforward to download the corresponding *daily* Fama-French factors with the same function. \n\n::: {#f0848293 .cell execution_count=7}\n``` {.python .cell-code}\nfactors_ff3_daily_raw = pdr.DataReader(\n name=\"F-F_Research_Data_Factors_daily\",\n data_source=\"famafrench\", \n start=start_date, \n end=end_date)[0]\n\nfactors_ff3_daily = (factors_ff3_daily_raw\n .divide(100)\n .reset_index(names=\"date\")\n .rename(str.lower, axis=\"columns\")\n .rename(columns={\"mkt-rf\": \"mkt_excess\"})\n)\n```\n:::\n\n\nIn a subsequent chapter, we also use the monthly returns from ten industry portfolios, so let us fetch that data, too.\\index{Data!Industry portfolios}\n\n::: {#cbe8cb33 .cell execution_count=8}\n``` {.python .cell-code}\nindustries_ff_monthly_raw = pdr.DataReader(\n name=\"10_Industry_Portfolios\",\n data_source=\"famafrench\", \n start=start_date, \n end=end_date)[0]\n\nindustries_ff_monthly = (industries_ff_monthly_raw\n .divide(100)\n .reset_index(names=\"date\")\n .assign(date=lambda x: pd.to_datetime(x[\"date\"].astype(str)))\n .rename(str.lower, axis=\"columns\")\n)\n```\n:::\n\n\nIt is worth taking a look at all available portfolio return time series from Kenneth French's homepage. You should check out the other sets by calling `pdr.famafrench.get_available_datasets()`.\n\nTo automatically download and process Fama-French data, you can also use the `tidyfinance` package with `domain=\"factors_ff\"` and the corresponding dataset, e.g.:\n\n::: {#557adcfb .cell execution_count=9}\n``` {.python .cell-code}\ntf.download_data(\n domain=\"factors_ff\",\n dataset=\"F-F_Research_Data_Factors\", \n start_date=start_date, \n end_date=end_date\n)\n```\n:::\n\n\nThe `tidyfinance` package implements the processing steps as above and returns the same cleaned data frame. \n\n## q-Factors\n\nIn recent years, the academic discourse experienced the rise of alternative factor models, e.g., in the form of the @Hou2015 *q*-factor model. We refer to the [extended background](http://global-q.org/background.html) information provided by the original authors for further information. The *q*-factors can be downloaded directly from the authors' homepage from within `pd.read_csv()`. \\index{Data!q-factors}\\index{Factor!q-factors}\n\nWe also need to adjust this data. First, we discard information we will not use in the remainder of the book. Then, we rename the columns with the \"R_\"-prescript using regular expressions and write all column names in lowercase. We then query the data to select observations between the start and end dates. Finally, we use the double asterisk (`**`) notation in the `assign` function to apply the same transform of dividing by 100 to all four factors by iterating through them. You should always try sticking to a consistent style for naming objects, which we try to illustrate here - the emphasis is on *try*. You can check out style guides available online, e.g., [Hadley Wickham's `tidyverse` style guide.](https://style.tidyverse.org/index.html)\\index{Style guide} note that we temporarily adjust the SSL certificate handling behavior in Python’s \n`ssl` module when retrieving the $q$-factors directly from the web, as demonstrated in [Working with Stock Returns](working-with-stock-returns.qmd). This method should be used with caution, which is why we restore the default settings immediately after successfully downloading the data.\n\n::: {#1bcb6c34 .cell execution_count=10}\n``` {.python .cell-code}\nimport ssl\nssl._create_default_https_context = ssl._create_unverified_context\n\nfactors_q_monthly_link = (\n \"https://global-q.org/uploads/1/2/2/6/122679606/\"\n \"q5_factors_monthly_2024.csv\"\n)\n\nfactors_q_monthly = (pd.read_csv(factors_q_monthly_link)\n .assign(\n date=lambda x: (\n pd.to_datetime(x[\"year\"].astype(str) + \"-\" +\n x[\"month\"].astype(str) + \"-01\"))\n )\n .drop(columns=[\"R_F\", \"R_MKT\", \"year\"])\n .rename(columns=lambda x: x.replace(\"R_\", \"\").lower())\n .query(f\"date >= '{start_date}' and date <= '{end_date}'\")\n .assign(\n **{col: lambda x: x[col]/100 for col in [\"me\", \"ia\", \"roe\", \"eg\"]}\n )\n)\n\nssl._create_default_https_context = ssl.create_default_context\n```\n:::\n\n\nAgain, you can use the `tidyfinance` package for a shortcut:\n\n::: {#5cc217ab .cell execution_count=11}\n``` {.python .cell-code}\ntf.download_data(\n domain=\"factors_q\",\n dataset=\"q5_factors_monthly\", \n start_date=start_date, \n end_date=end_date\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=11}\n```{=html}\n
\n\n
\n \n
\n
\n
date
\n
risk_free
\n
mkt_excess
\n
me
\n
ia
\n
roe
\n
eg
\n
\n \n \n
\n
0
\n
1967-01-01
\n
0.003927
\n
0.081852
\n
0.068122
\n
-0.029263
\n
0.018813
\n
-0.025511
\n
\n
\n
1
\n
1967-02-01
\n
0.003743
\n
0.007557
\n
0.016235
\n
-0.002915
\n
0.035399
\n
0.021792
\n
\n
\n
2
\n
1967-03-01
\n
0.003693
\n
0.040169
\n
0.019836
\n
-0.016772
\n
0.018417
\n
-0.011192
\n
\n
\n
3
\n
1967-04-01
\n
0.003344
\n
0.038786
\n
-0.006700
\n
-0.028972
\n
0.010253
\n
-0.016371
\n
\n
\n
4
\n
1967-05-01
\n
0.003126
\n
-0.042807
\n
0.027457
\n
0.021864
\n
0.005901
\n
0.001191
\n
\n
\n
...
\n
...
\n
...
\n
...
\n
...
\n
...
\n
...
\n
...
\n
\n
\n
691
\n
2024-08-01
\n
0.004419
\n
0.016518
\n
-0.040817
\n
0.004687
\n
0.018369
\n
0.008116
\n
\n
\n
692
\n
2024-09-01
\n
0.004619
\n
0.016806
\n
-0.011967
\n
-0.000010
\n
0.007408
\n
-0.032810
\n
\n
\n
693
\n
2024-10-01
\n
0.003907
\n
-0.009701
\n
-0.011261
\n
-0.011676
\n
-0.002314
\n
-0.008335
\n
\n
\n
694
\n
2024-11-01
\n
0.003955
\n
0.065002
\n
0.043985
\n
-0.049491
\n
-0.015370
\n
-0.021420
\n
\n
\n
695
\n
2024-12-01
\n
0.003663
\n
-0.031637
\n
-0.051564
\n
-0.003684
\n
-0.021442
\n
0.049624
\n
\n \n
\n
696 rows × 7 columns
\n
\n```\n:::\n:::\n\n\n## Macroeconomic Predictors\n\nOur next data source is a set of macroeconomic variables often used as predictors for the equity premium. @Goyal2008 comprehensively reexamine the performance of variables suggested by the academic literature to be good predictors of the equity premium. The authors host the data on [Amit Goyal's website.](https://sites.google.com/view/agoyal145) Since the data is an XLSX-file stored on a public Google Drive location, we need additional packages to access the data directly from our Python session. Usually, you need to authenticate if you interact with Google drive directly in Python. Since the data is stored via a public link, we can proceed without any authentication.\\index{Google Drive}\n\n::: {#7bd33a2b .cell execution_count=12}\n``` {.python .cell-code}\nsheet_id = \"1bM7vCWd3WOt95Sf9qjLPZjoiafgF_8EG\"\nsheet_name = \"macro_predictors.xlsx\"\nmacro_predictors_link = (\n f\"https://docs.google.com/spreadsheets/d/{sheet_id}\" \n f\"/gviz/tq?tqx=out:csv&sheet={sheet_name}\"\n)\n```\n:::\n\n\nNext, we read in the new data and transform the columns into the variables that we later use:\n\n1. The dividend price ratio (`dp`), the difference between the log of dividends and the log of prices, where dividends are 12-month moving sums of dividends paid on the S&P 500 index, and prices are monthly averages of daily closing prices [@Campbell1988; @Campbell2006]. \n1. Dividend yield (`dy`), the difference between the log of dividends and the log of lagged prices [@Ball1978]. \n1. Earnings price ratio (`ep`), the difference between the log of earnings and the log of prices, where earnings are 12-month moving sums of earnings on the S&P 500 index [@Campbell1988]. \n1. Dividend payout ratio (`de`), the difference between the log of dividends and the log of earnings [@Lamont1998]. \n1. Stock variance (`svar`), the sum of squared daily returns on the S&P 500 index [@Guo2006].\n1. Book-to-market ratio (`bm`), the ratio of book value to market value for the Dow Jones Industrial Average [@Kothari1997].\n1. Net equity expansion (`ntis`), the ratio of 12-month moving sums of net issues by NYSE listed stocks divided by the total end-of-year market capitalization of NYSE stocks [@Campbell2008].\n1. Treasury bills (`tbl`), the 3-Month Treasury Bill: Secondary Market Rate from the economic research database at the Federal Reserve Bank at St. Louis [@Campbell1987].\n1. Long-term yield (`lty`), the long-term government bond yield from Ibbotson's Stocks, Bonds, Bills, and Inflation Yearbook [@Goyal2008].\n1. Long-term rate of returns (`ltr`), the long-term government bond returns from Ibbotson's Stocks, Bonds, Bills, and Inflation Yearbook [@Goyal2008].\n1. Term spread (`tms`), the difference between the long-term yield on government bonds and the Treasury bill [@Campbell1987].\n1. Default yield spread (`dfy`), the difference between BAA and AAA-rated corporate bond yields [@Fama1989]. \n1. Inflation (`infl`), the Consumer Price Index (All Urban Consumers) from the Bureau of Labor Statistics [@Campbell2004].\n\t\t\t\nFor variable definitions and the required data transformations, you can consult the material on [Amit Goyal's website.](https://sites.google.com/view/agoyal145)\n\n::: {#3fc6df88 .cell execution_count=13}\n``` {.python .cell-code}\nssl._create_default_https_context = ssl._create_unverified_context\n\nmacro_predictors = (\n pd.read_csv(macro_predictors_link, thousands=\",\")\n .assign(\n date=lambda x: pd.to_datetime(x[\"yyyymm\"], format=\"%Y%m\"),\n dp=lambda x: np.log(x[\"D12\"])-np.log(x[\"Index\"]),\n dy=lambda x: np.log(x[\"D12\"])-np.log(x[\"Index\"].shift(1)),\n ep=lambda x: np.log(x[\"E12\"])-np.log(x[\"Index\"]),\n de=lambda x: np.log(x[\"D12\"])-np.log(x[\"E12\"]),\n tms=lambda x: x[\"lty\"]-x[\"tbl\"],\n dfy=lambda x: x[\"BAA\"]-x[\"AAA\"]\n )\n .rename(columns={\"b/m\": \"bm\"})\n .get([\"date\", \"dp\", \"dy\", \"ep\", \"de\", \"svar\", \"bm\", \n \"ntis\", \"tbl\", \"lty\", \"ltr\", \"tms\", \"dfy\", \"infl\"])\n .query(\"date >= @start_date and date <= @end_date\")\n .dropna()\n)\n\nssl._create_default_https_context = ssl.create_default_context\n```\n:::\n\n\nTo get the equivalent data through `tidyfinance`, you can call:\n\n::: {#5f267096 .cell execution_count=14}\n``` {.python .cell-code}\ntf.download_data(\n domain=\"macro_predictors\",\n dataset=\"monthly\",\n start_date=start_date, \n end_date=end_date\n)\n```\n:::\n\n\n## Other Macroeconomic Data\n\nThe Federal Reserve bank of St. Louis provides the Federal Reserve Economic Data (FRED), an extensive database for macroeconomic data. In total, there are 817,000 US and international time series from 108 different sources. As an illustration, we use the already familiar `pandas-datareader` package to fetch consumer price index (CPI) data that can be found under the [CPIAUCNS](https://fred.stlouisfed.org/series/CPIAUCNS) key.\\index{Data!FRED}\\index{Data!CPI}\n\n::: {#b0515f3b .cell execution_count=15}\n``` {.python .cell-code}\ncpi_monthly = (pdr.DataReader(\n name=\"CPIAUCNS\", \n data_source=\"fred\", \n start=start_date, \n end=end_date\n )\n .reset_index(names=\"date\")\n .rename(columns={\"CPIAUCNS\": \"cpi\"})\n .assign(cpi=lambda x: x[\"cpi\"] / x[\"cpi\"].iloc[-1])\n)\n```\n:::\n\n\nNote that we use the `assign()` in the last line to set the current (latest) price level as the reference inflation level. To download other time series, we just have to look it up on the FRED website and extract the corresponding key from the address. For instance, the producer price index for gold ores can be found under the [PCU2122212122210](https://fred.stlouisfed.org/series/PCU2122212122210) key.\n\nThe `tidyfinance` package can, of course, also fetch the same daily data and many more data series:\n\n::: {#ceef174a .cell execution_count=16}\n``` {.python .cell-code}\ntf.download_data(\n domain=\"fred\",\n series=\"CPIAUCNS\", \n start_date=start_date, \n end_date=end_date\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nFailed to retrieve data for series CPIAUCNS: Failed to perform, curl: (6) Could not resolve host: https. See https://curl.se/libcurl/c/libcurl-errors.html first for more details.\nFailed to retrieve data for series CPIAUCNS: 'date'\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=16}\n```{=html}\n
\n\n
\n \n
\n
\n
date
\n
value
\n
series
\n
\n \n \n \n
\n
\n```\n:::\n:::\n\n\nTo download other time series, we just have to look it up on the FRED website and extract the corresponding key from the address. For instance, the producer price index for gold ores can be found under the [PCU2122212122210](https://fred.stlouisfed.org/series/PCU2122212122210) key. If your desired time series is not supported through tidyfinance, we recommend working with the `fredapi` package. Note that you need to get an API key to use its functionality. We refer to the package documentation for details.\n\n## Setting Up a Database\n\nNow that we have downloaded some (freely available) data from the web into the memory of our Python session, let us set up a database to store that information for future use. We will use the data stored in this database throughout the following chapters, but you could alternatively implement a different strategy and replace the respective code. \n\nThere are many ways to set up and organize a database, depending on the use case. For our purpose, the most efficient way is to use an [SQLite](https://SQLite.org/)-database, which is the C-language library that implements a small, fast, self-contained, high-reliability, full-featured SQL database engine. Note that [SQL](https://en.wikipedia.org/wiki/SQL) (Structured Query Language) is a standard language for accessing and manipulating databases.\\index{Database!SQLite}\n\n::: {#a10081e3 .cell execution_count=17}\n``` {.python .cell-code}\nimport sqlite3\n```\n:::\n\n\nAn SQLite-database is easily created - the code below is really all there is. You do not need any external software. Otherwise, date columns are stored and retrieved as integers.\\index{Database!Creation} We will use the file `tidy_finance_r.sqlite`, located in the data subfolder, to retrieve data for all subsequent chapters. The initial part of the code ensures that the directory is created if it does not already exist.\n\n::: {#4b49f781 .cell execution_count=18}\n``` {.python .cell-code}\nimport os\n\nif not os.path.exists(\"data\"):\n os.makedirs(\"data\")\n \ntidy_finance = sqlite3.connect(database=\"data/tidy_finance_python.sqlite\")\n```\n:::\n\n\nNext, we create a remote table with the monthly Fama-French factor data. We do so with the `pandas` function `to_sql()`, which copies the data to our SQLite-database.\n\n::: {#c2800478 .cell execution_count=19}\n``` {.python .cell-code}\n(factors_ff3_monthly\n .to_sql(name=\"factors_ff3_monthly\", \n con=tidy_finance, \n if_exists=\"replace\",\n index=False)\n)\n```\n:::\n\n\nNow, if we want to have the whole table in memory, we need to call `pd.read_sql_query()` with the corresponding query. You will see that we regularly load the data into the memory in the next chapters.\\index{Database!Read}\n\n::: {#dbe240b7 .cell execution_count=20}\n``` {.python .cell-code}\npd.read_sql_query(\n sql=\"SELECT date, rf FROM factors_ff3_monthly\",\n con=tidy_finance,\n parse_dates={\"date\"}\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=20}\n```{=html}\n
\n\n
\n \n
\n
\n
date
\n
rf
\n
\n \n \n
\n
0
\n
1960-01-01
\n
0.0033
\n
\n
\n
1
\n
1960-02-01
\n
0.0029
\n
\n
\n
2
\n
1960-03-01
\n
0.0035
\n
\n
\n
3
\n
1960-04-01
\n
0.0019
\n
\n
\n
4
\n
1960-05-01
\n
0.0027
\n
\n
\n
...
\n
...
\n
...
\n
\n
\n
775
\n
2024-08-01
\n
0.0048
\n
\n
\n
776
\n
2024-09-01
\n
0.0040
\n
\n
\n
777
\n
2024-10-01
\n
0.0039
\n
\n
\n
778
\n
2024-11-01
\n
0.0040
\n
\n
\n
779
\n
2024-12-01
\n
0.0037
\n
\n \n
\n
780 rows × 2 columns
\n
\n```\n:::\n:::\n\n\nThe last couple of code chunks are really all there is to organizing a simple database! You can also share the SQLite database across devices and programming languages. \n\nBefore we move on to the next data source, let us also store the other six tables in our new SQLite database. \n\n::: {#4a6705c7 .cell execution_count=21}\n``` {.python .cell-code}\ndata_dict = {\n \"factors_ff5_monthly\": factors_ff5_monthly,\n \"factors_ff3_daily\": factors_ff3_daily,\n \"industries_ff_monthly\": industries_ff_monthly, \n \"factors_q_monthly\": factors_q_monthly,\n \"macro_predictors\": macro_predictors,\n \"cpi_monthly\": cpi_monthly\n}\n\nfor key, value in data_dict.items():\n value.to_sql(name=key,\n con=tidy_finance, \n if_exists=\"replace\",\n index=False)\n```\n:::\n\n\nFrom now on, all you need to do to access data that is stored in the database is to follow two steps: (i) Establish the connection to the SQLite-database and (ii) execute the query to fetch the data. For your convenience, the following steps show all you need in a compact fashion.\\index{Database!Connection}\n\n::: {#045cddbc .cell message='false' results='false' execution_count=22}\n``` {.python .cell-code}\nimport pandas as pd\nimport sqlite3\n\ntidy_finance = sqlite3.connect(database=\"data/tidy_finance_python.sqlite\")\n\nfactors_q_monthly = pd.read_sql_query(\n sql=\"SELECT * FROM factors_q_monthly\",\n con=tidy_finance,\n parse_dates={\"date\"}\n)\n```\n:::\n\n\n## Managing SQLite Databases\n\nFinally, at the end of our data chapter, we revisit the SQLite database itself. When you drop database objects such as tables or delete data from tables, the database file size remains unchanged because SQLite just marks the deleted objects as free and reserves their space for future uses. As a result, the database file always grows in size.\\index{Database!Management}\n\nTo optimize the database file, you can run the `VACUUM` command in the database, which rebuilds the database and frees up unused space. You can execute the command in the database using the `execute()` function. \n\n::: {#b530109e .cell execution_count=23}\n``` {.python .cell-code}\ntidy_finance.execute(\"VACUUM\")\n```\n:::\n\n\nThe `VACUUM` command actually performs a couple of additional cleaning steps, which you can read about in [this tutorial.](https://SQLite.org/docs/sql/statements/vacuum.html) \\index{Database!Cleaning}\n\n## Key Takeaways\n\n- Importing Fama-French factors, q-factors, macroeconomic indicators, and CPI data is simplified through API calls, CSV parsing, and web scraping techniques.\n- The `tidyfinance` Python package offers pre-processed access to financial datasets, reducing manual data cleaning and saving valuable time.\n- Creating a centralized SQLite database helps manage and organize data efficiently across projects, while maintaining reproducibility.\n- Structured database storage supports scalable data access, which is essential for long-term academic projects and collaborative work in finance.\n\n## Exercises\n\n1. Download the monthly Fama-French factors manually from [Kenneth French's data library](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html) and read them in via `pd.read_csv()`. Validate that you get the same data as via the `pandas-datareader` package. \n1. Download the daily Fama-French 5 factors using the `pdr.DataReader()` package. After the successful download and conversion to the column format that we used above, compare the `rf`, `mkt_excess`, `smb`, and `hml` columns of `factors_ff3_daily` to `factors_ff5_daily`. Discuss any differences you might find. \n\n",
+ "markdown": "---\ntitle: Accessing and Managing Financial Data\nmetadata:\n pagetitle: Accessing and Managing Financial Data with Python\n description-meta: Download and organize open-source financial data using the programming language Python. \n---\n\n\n\n::: callout-note\nYou are reading **Tidy Finance with Python**. You can find the equivalent chapter for the sibling **Tidy Finance with R** [here](../r/accessing-and-managing-financial-data.qmd).\n:::\n\nIn this chapter, we suggest a way to organize your financial data. Everybody who has experience with data is also familiar with storing data in various formats like CSV, XLS, XLSX, or other delimited value storage. Reading and saving data can become very cumbersome when using different data formats and across different projects. Moreover, storing data in delimited files often leads to problems with respect to column type consistency. For instance, date-type columns frequently lead to inconsistencies across different data formats and programming languages. \n\nThis chapter shows how to import different open-source datasets. Specifically, our data comes from the application programming interface (API) of Yahoo Finance, a downloaded standard CSV file, an XLSX file stored in a public Google Drive repository, and other macroeconomic time series.\\index{API} We store all the data in a *single* database, which serves as the only source of data in subsequent chapters. We conclude the chapter by providing some tips on managing databases.\\index{Database}\n\nFirst, we load the Python packages that we use throughout this chapter. Later on, we load more packages in the sections where we need them. \n\n::: {#2068e538 .cell execution_count=2}\n``` {.python .cell-code}\nimport pandas as pd\nimport numpy as np\nimport io\nimport re\nimport zipfile\nfrom curl_cffi import requests\n```\n:::\n\n\nMoreover, we initially define the date range for which we fetch and store the financial data, making future data updates tractable. In case you need another time frame, you can adjust the dates below. Our data starts with 1960 since most asset pricing studies use data from 1962 on.\n\n::: {#c7108e44 .cell execution_count=3}\n``` {.python .cell-code}\nstart_date = \"1960-01-01\"\nend_date = \"2024-12-31\"\n```\n:::\n\n\n## Fama-French Data\n\nWe start by downloading some famous Fama-French factors [e.g., @Fama1993] and portfolio returns commonly used in empirical asset pricing. The data are freely available from Kenneth French’s Data Library, but the raw files come in a rather idiosyncratic format. If you access the data via the website, the manual *raw* workflow looks like this:\n\n1. Go to the website\n1. Find the right dataset\n1. Download a ZIP file\n1. Extract the CSV inside\n1. Select the right data table from the file and import the table into Python\n1. Clean the dates, scale the returns, fix column names, handle missing values, etc.\n\nDoing this once is fine; doing it repeatedly across projects is exactly the type of boilerplate that’s easy to mess up and annoying to maintain. It is therefore natural to automate these steps in Python.\n\n# From manual steps to a download script\n\nA minimal download script mirrors the manual steps one by one. For example, to fetch a Fama–French dataset you first construct the URL:\n\n::: {#9ab8e5db .cell execution_count=4}\n``` {.python .cell-code}\ndataset = \"F-F_Research_Data_Factors\"\nbase_url = \"http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/\"\nurl = f\"{base_url}{dataset}_CSV.zip\"\n```\n:::\n\n\nNext, you replace the browser download with an HTTP request and extract the ZIP in memory:\n\n::: {#aafb772e .cell execution_count=5}\n``` {.python .cell-code}\nresp = requests.get(url)\nresp.raise_for_status()\n\nwith zipfile.ZipFile(io.BytesIO(resp.content)) as zf:\n file_name = zf.namelist()[0] # Ken French ZIPs contain one file\n raw_text = zf.read(file_name).decode(\"latin1\")\n```\n:::\n\n\nThe most important part of this chunk is the `requests.get()` call. This is the moment where we replace all the manual browser work (open the website, click download, save the file) with a single, reproducible line of code. Then, calling `raise_for_status()` ensures we stop immediately if the server returns an error (e.g. HTTP 404 or 500) instead of quietly handling a broken file. Once this succeeds, `resp.content` is guaranteed to contain valid ZIP bytes that we can open in memory.\n\nThe raw file contains documentation text followed by the actual data table(s). To emulate *scrolling down until the numbers start*, you can split the file into blocks and keep the long one that contains the table:\n\n::: {#61d53ef8 .cell execution_count=6}\n``` {.python .cell-code}\nchunks = raw_text.split(\"\\r\\n\\r\\n\")\ntable_text = max(chunks, key=len) \n```\n:::\n\n\nWithin this block, the first CSV header line starts at the first line beginning with a comma. We add a “Date” label for the index and pass everything to `read_csv`:\n\n::: {#d72929a3 .cell execution_count=7}\n``` {.python .cell-code}\nmatch = re.search(r\"^\\s*,\", table_text, flags=re.M)\nstart = match.start()\ncsv_text = \"Date\" + table_text[start:]\n\nfactors_ff_raw = pd.read_csv(io.StringIO(csv_text), index_col=0)\n```\n:::\n\n\nAt this point, the index still consists of integer date codes with different lengths depending on the frequency. We need a bit of logic to convert them into a proper `DatetimeIndex`:\n\n::: {#7a97a04a .cell execution_count=8}\n``` {.python .cell-code}\ns = factors_ff_raw.index.astype(str)\n\nif (s.str.len() == 8).all(): # daily: YYYYMMDD\n dt = pd.to_datetime(s, format=\"%Y%m%d\")\nelif (s.str.len() == 6).all(): # monthly: YYYYMM\n dt = pd.to_datetime(s + \"01\", format=\"%Y%m%d\")\nelif (s.str.len() == 4).all(): # annual: YYYY\n dt = pd.to_datetime(s + \"0101\", format=\"%Y%m%d\")\n dt = dt.dt.to_period(\"A-DEC\").dt.to_timestamp(\"end\")\nelse:\n raise ValueError(\"Unknown date format in Fama–French index.\")\n\nfactors_ff_raw = factors_ff_raw.set_index(dt)\nfactors_ff_raw.index.name = \"date\"\n```\n:::\n\n\nFinally, we still have to clean the data:\n\n- Convert returns from percent to decimal.\n- Standardize column names (e.g., all lower case and Mkt-RF to mkt_excess, RF to risk_free)\n- Replace special missing-value codes (-99.99, -999) with actual missing values\n- Filter the data by a start and end date\n\nThis all could look like this:\n\n::: {#77d01e70 .cell execution_count=9}\n``` {.python .cell-code}\n# start and end dates\nif start_date:\n factors_ff_raw = factors_ff_raw[factors_ff_raw.index >= pd.to_datetime(start_date)]\nif end_date:\n factors_ff_raw = factors_ff_raw[factors_ff_raw.index <= pd.to_datetime(end_date)]\n\nfactors_ff3_monthly = (factors_ff_raw\n .div(100)\n .reset_index(names=\"date\")\n .rename(columns=str.lower)\n .rename(columns={\"mkt-rf\": \"mkt_excess\", \"rf\": \"risk_free\"})\n .replace({\"-99.99\": pd.NA, -99.99: pd.NA, -999: pd.NA})\n)\nfactors_ff3_monthly\n```\n\n::: {.cell-output .cell-output-display execution_count=37}\n```{=html}\n
\n\n
\n \n
\n
\n
date
\n
mkt_excess
\n
smb
\n
hml
\n
risk_free
\n
\n \n \n
\n
0
\n
1960-01-01
\n
-0.0698
\n
0.0212
\n
0.0265
\n
0.0033
\n
\n
\n
1
\n
1960-02-01
\n
0.0116
\n
0.0060
\n
-0.0197
\n
0.0029
\n
\n
\n
2
\n
1960-03-01
\n
-0.0163
\n
-0.0055
\n
-0.0275
\n
0.0035
\n
\n
\n
3
\n
1960-04-01
\n
-0.0171
\n
0.0022
\n
-0.0214
\n
0.0019
\n
\n
\n
4
\n
1960-05-01
\n
0.0312
\n
0.0129
\n
-0.0373
\n
0.0027
\n
\n
\n
...
\n
...
\n
...
\n
...
\n
...
\n
...
\n
\n
\n
775
\n
2024-08-01
\n
0.0160
\n
-0.0349
\n
-0.0110
\n
0.0048
\n
\n
\n
776
\n
2024-09-01
\n
0.0172
\n
-0.0013
\n
-0.0277
\n
0.0040
\n
\n
\n
777
\n
2024-10-01
\n
-0.0100
\n
-0.0099
\n
0.0086
\n
0.0039
\n
\n
\n
778
\n
2024-11-01
\n
0.0649
\n
0.0446
\n
0.0015
\n
0.0040
\n
\n
\n
779
\n
2024-12-01
\n
-0.0317
\n
-0.0271
\n
-0.0300
\n
0.0037
\n
\n \n
\n
780 rows × 5 columns
\n
\n```\n:::\n:::\n\n\nAll of these steps are doable, but none of them are really about finance - they are just the technical scaffolding required before you can work with the actual factor returns. That’s where a dedicated helper or package becomes invaluable. The `tidyfinance` package performs this entire workflow under the hood: you request a Fama–French dataset and receive a clean, consistently formatted data table from Kenneth French's Data Library.\\index{Data!Fama-French factors}\\index{Kenneth French homepage}. This avoids repetitive boilerplate, reduces errors, and lets you focus on modeling and analysis rather than on data plumbing.\n\n# Using `tidyfinance` instead of reimplementing the plumbing\n\n::: {#18f92154 .cell execution_count=10}\n``` {.python .cell-code}\nimport tidyfinance as tf\n```\n:::\n\n\nFor example, we can use the `tf.download_data()` function of the package to download monthly Fama-French factors. The set *Fama/French 3 Factors* contains the return time series of the market (`mkt_excess`), size (`smb`), and value (`hml`) factors alongside the risk-free rates (`risk_free`). Note that the `tf.download_data()` function parses all the columns correctly and already scale them appropriately, as the raw Fama-French data comes in a unique data format. For precise descriptions of the variables, we suggest consulting Prof. Kenneth French's finance data library directly. If you are on the website, check the raw data files to appreciate the time you can save thanks to the `tidyfinance` package.\\index{Factor!Market}\\index{Factor!Size}\\index{Factor!Value}\\index{Factor!Profitability}\\index{Factor!Investment}\\index{Risk-free rate}\n\n::: {#fd60bf92 .cell execution_count=11}\n``` {.python .cell-code}\nfactors_ff3_monthly = tf.download_data(\n domain=\"famafrench\",\n dataset=\"F-F_Research_Data_Factors\",\n start_date=start_date,\n end_date=end_date,\n)\n```\n:::\n\n\nWe also download the set *5 Factors (2x3)*, which additionally includes the return time series of the profitability (`rmw`) and investment (`cma`) factors. We demonstrate how the monthly factors are constructed in [Replicating Fama and French Factors](replicating-fama-and-french-factors.qmd).\n\n::: {#bf347fbd .cell execution_count=12}\n``` {.python .cell-code}\nfactors_ff5_monthly = tf.download_data(\n domain=\"famafrench\",\n dataset=\"F-F_Research_Data_5_Factors_2x3\",\n start_date=start_date,\n end_date=end_date,\n)\n```\n:::\n\n\nIt is straightforward to download the corresponding *daily* Fama-French factors with the same function. \n\n::: {#7708ed92 .cell execution_count=13}\n``` {.python .cell-code}\nfactors_ff3_daily = tf.download_data(\n domain=\"famafrench\",\n dataset=\"F-F_Research_Data_Factors_daily\",\n start_date=start_date,\n end_date=end_date,\n)\n```\n:::\n\n\nIn a subsequent chapter, we also use the monthly returns from ten industry portfolios, so let us fetch that data, too.\\index{Data!Industry portfolios}\n\n::: {#6a04aed2 .cell execution_count=14}\n``` {.python .cell-code}\nindustries_ff_monthly = tf.download_data(\n domain=\"famafrench\",\n dataset=\"10_Industry_Portfolios\",\n start_date=start_date,\n end_date=end_date,\n)\n```\n:::\n\n\nIt is worth taking a look at all available portfolio return time series from Kenneth French's homepage. You should check out the other sets by calling `tf.get_available_famafrench_datasets()`.\n\n## q-Factors\n\nIn recent years, the academic discourse experienced the rise of alternative factor models, e.g., in the form of the @Hou2015 *q*-factor model. We refer to the [extended background](http://global-q.org/background.html) information provided by the original authors for further information. The *q*-factors can be downloaded directly from the authors' homepage from within `pd.read_csv()`. \\index{Data!q-factors}\\index{Factor!q-factors}\n\nWe also need to adjust this data. First, we discard information we will not use in the remainder of the book. Then, we rename the columns with the \"R_\"-prescript using regular expressions and write all column names in lowercase. We then query the data to select observations between the start and end dates. Finally, we use the double asterisk (`**`) notation in the `assign` function to apply the same transform of dividing by 100 to all four factors by iterating through them. You should always try sticking to a consistent style for naming objects, which we try to illustrate here - the emphasis is on *try*. You can check out style guides available online, e.g., [Hadley Wickham's `tidyverse` style guide.](https://style.tidyverse.org/index.html)\\index{Style guide} note that we temporarily adjust the SSL certificate handling behavior in Python’s \n`ssl` module when retrieving the $q$-factors directly from the web, as demonstrated in [Working with Stock Returns](working-with-stock-returns.qmd). This method should be used with caution, which is why we restore the default settings immediately after successfully downloading the data.\n\n::: {#4bf4afe7 .cell execution_count=15}\n``` {.python .cell-code}\nimport ssl\nssl._create_default_https_context = ssl._create_unverified_context\n\nfactors_q_monthly_link = (\n \"https://global-q.org/uploads/1/2/2/6/122679606/\"\n \"q5_factors_monthly_2024.csv\"\n)\n\nfactors_q_monthly = (pd.read_csv(factors_q_monthly_link)\n .assign(\n date=lambda x: (\n pd.to_datetime(x[\"year\"].astype(str) + \"-\" +\n x[\"month\"].astype(str) + \"-01\"))\n )\n .drop(columns=[\"R_F\", \"R_MKT\", \"year\"])\n .rename(columns=lambda x: x.replace(\"R_\", \"\").lower())\n .query(f\"date >= '{start_date}' and date <= '{end_date}'\")\n .assign(\n **{col: lambda x: x[col]/100 for col in [\"me\", \"ia\", \"roe\", \"eg\"]}\n )\n)\n\nssl._create_default_https_context = ssl.create_default_context\n```\n:::\n\n\nAgain, you can use the `tidyfinance` package for a shortcut:\n\n::: {#87c32f1a .cell execution_count=16}\n``` {.python .cell-code}\ntf.download_data(\n domain=\"factors_q\",\n dataset=\"q5_factors_monthly\", \n start_date=start_date, \n end_date=end_date\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=44}\n```{=html}\n
\n\n
\n \n
\n
\n
date
\n
risk_free
\n
mkt_excess
\n
me
\n
ia
\n
roe
\n
eg
\n
\n \n \n
\n
0
\n
1967-01-01
\n
0.003927
\n
0.081852
\n
0.068122
\n
-0.029263
\n
0.018813
\n
-0.025511
\n
\n
\n
1
\n
1967-02-01
\n
0.003743
\n
0.007557
\n
0.016235
\n
-0.002915
\n
0.035399
\n
0.021792
\n
\n
\n
2
\n
1967-03-01
\n
0.003693
\n
0.040169
\n
0.019836
\n
-0.016772
\n
0.018417
\n
-0.011192
\n
\n
\n
3
\n
1967-04-01
\n
0.003344
\n
0.038786
\n
-0.006700
\n
-0.028972
\n
0.010253
\n
-0.016371
\n
\n
\n
4
\n
1967-05-01
\n
0.003126
\n
-0.042807
\n
0.027457
\n
0.021864
\n
0.005901
\n
0.001191
\n
\n
\n
...
\n
...
\n
...
\n
...
\n
...
\n
...
\n
...
\n
...
\n
\n
\n
691
\n
2024-08-01
\n
0.004419
\n
0.016518
\n
-0.040817
\n
0.004687
\n
0.018369
\n
0.008116
\n
\n
\n
692
\n
2024-09-01
\n
0.004619
\n
0.016806
\n
-0.011967
\n
-0.000010
\n
0.007408
\n
-0.032810
\n
\n
\n
693
\n
2024-10-01
\n
0.003907
\n
-0.009701
\n
-0.011261
\n
-0.011676
\n
-0.002314
\n
-0.008335
\n
\n
\n
694
\n
2024-11-01
\n
0.003955
\n
0.065002
\n
0.043985
\n
-0.049491
\n
-0.015370
\n
-0.021420
\n
\n
\n
695
\n
2024-12-01
\n
0.003663
\n
-0.031637
\n
-0.051564
\n
-0.003684
\n
-0.021442
\n
0.049624
\n
\n \n
\n
696 rows × 7 columns
\n
\n```\n:::\n:::\n\n\n## Macroeconomic Predictors\n\nOur next data source is a set of macroeconomic variables often used as predictors for the equity premium. @Goyal2008 comprehensively reexamine the performance of variables suggested by the academic literature to be good predictors of the equity premium. The authors host the data on [Amit Goyal's website.](https://sites.google.com/view/agoyal145) Since the data is an XLSX-file stored on a public Google Drive location, we need additional packages to access the data directly from our Python session. Usually, you need to authenticate if you interact with Google drive directly in Python. Since the data is stored via a public link, we can proceed without any authentication.\\index{Google Drive}\n\n::: {#6ed1395b .cell execution_count=17}\n``` {.python .cell-code}\nsheet_id = \"1bM7vCWd3WOt95Sf9qjLPZjoiafgF_8EG\"\nsheet_name = \"macro_predictors.xlsx\"\nmacro_predictors_link = (\n f\"https://docs.google.com/spreadsheets/d/{sheet_id}\" \n f\"/gviz/tq?tqx=out:csv&sheet={sheet_name}\"\n)\n```\n:::\n\n\nNext, we read in the new data and transform the columns into the variables that we later use:\n\n1. The dividend price ratio (`dp`), the difference between the log of dividends and the log of prices, where dividends are 12-month moving sums of dividends paid on the S&P 500 index, and prices are monthly averages of daily closing prices [@Campbell1988; @Campbell2006]. \n1. Dividend yield (`dy`), the difference between the log of dividends and the log of lagged prices [@Ball1978]. \n1. Earnings price ratio (`ep`), the difference between the log of earnings and the log of prices, where earnings are 12-month moving sums of earnings on the S&P 500 index [@Campbell1988]. \n1. Dividend payout ratio (`de`), the difference between the log of dividends and the log of earnings [@Lamont1998]. \n1. Stock variance (`svar`), the sum of squared daily returns on the S&P 500 index [@Guo2006].\n1. Book-to-market ratio (`bm`), the ratio of book value to market value for the Dow Jones Industrial Average [@Kothari1997].\n1. Net equity expansion (`ntis`), the ratio of 12-month moving sums of net issues by NYSE listed stocks divided by the total end-of-year market capitalization of NYSE stocks [@Campbell2008].\n1. Treasury bills (`tbl`), the 3-Month Treasury Bill: Secondary Market Rate from the economic research database at the Federal Reserve Bank at St. Louis [@Campbell1987].\n1. Long-term yield (`lty`), the long-term government bond yield from Ibbotson's Stocks, Bonds, Bills, and Inflation Yearbook [@Goyal2008].\n1. Long-term rate of returns (`ltr`), the long-term government bond returns from Ibbotson's Stocks, Bonds, Bills, and Inflation Yearbook [@Goyal2008].\n1. Term spread (`tms`), the difference between the long-term yield on government bonds and the Treasury bill [@Campbell1987].\n1. Default yield spread (`dfy`), the difference between BAA and AAA-rated corporate bond yields [@Fama1989]. \n1. Inflation (`infl`), the Consumer Price Index (All Urban Consumers) from the Bureau of Labor Statistics [@Campbell2004].\n\t\t\t\nFor variable definitions and the required data transformations, you can consult the material on [Amit Goyal's website.](https://sites.google.com/view/agoyal145)\n\n::: {#af3f3685 .cell execution_count=18}\n``` {.python .cell-code}\nssl._create_default_https_context = ssl._create_unverified_context\n\nmacro_predictors = (\n pd.read_csv(macro_predictors_link, thousands=\",\")\n .assign(\n date=lambda x: pd.to_datetime(x[\"yyyymm\"], format=\"%Y%m\"),\n dp=lambda x: np.log(x[\"D12\"])-np.log(x[\"Index\"]),\n dy=lambda x: np.log(x[\"D12\"])-np.log(x[\"Index\"].shift(1)),\n ep=lambda x: np.log(x[\"E12\"])-np.log(x[\"Index\"]),\n de=lambda x: np.log(x[\"D12\"])-np.log(x[\"E12\"]),\n tms=lambda x: x[\"lty\"]-x[\"tbl\"],\n dfy=lambda x: x[\"BAA\"]-x[\"AAA\"]\n )\n .rename(columns={\"b/m\": \"bm\"})\n .get([\"date\", \"dp\", \"dy\", \"ep\", \"de\", \"svar\", \"bm\", \n \"ntis\", \"tbl\", \"lty\", \"ltr\", \"tms\", \"dfy\", \"infl\"])\n .query(\"date >= @start_date and date <= @end_date\")\n .dropna()\n)\n\nssl._create_default_https_context = ssl.create_default_context\n```\n:::\n\n\nTo get the equivalent data through `tidyfinance`, you can call:\n\n::: {#fa0b3e29 .cell execution_count=19}\n``` {.python .cell-code}\ntf.download_data(\n domain=\"macro_predictors\",\n dataset=\"monthly\",\n start_date=start_date, \n end_date=end_date\n)\n```\n:::\n\n\n## Other Macroeconomic Data\n\nThe Federal Reserve bank of St. Louis provides the Federal Reserve Economic Data (FRED), an extensive database for macroeconomic data. In total, there are 817,000 US and international time series from 108 different sources. As an illustration, we use the `tidyfinance` package to fetch consumer price index (CPI) data that can be found under the [CPIAUCNS](https://fred.stlouisfed.org/series/CPIAUCNS) key.\\index{Data!FRED}\\index{Data!CPI}\n\n::: {#3d801a71 .cell execution_count=20}\n``` {.python .cell-code}\nseries = \"CPIAUCNS\"\nurl = f\"https://fred.stlouisfed.org/graph/fredgraph.csv?id={series}\"\n```\n:::\n\n\nWe can then use the `requests` module to request the CSV, extract the data from the response body, and convert the columns to a tidy format:\n\n::: {#80fe0fbe .cell execution_count=21}\n``` {.python .cell-code}\nresp = requests.get(url)\nresp_csv = pd.io.common.StringIO(resp.text)\n\ncpi_monthly = (pd.read_csv(resp_csv)\n .assign(\n date=lambda x: pd.to_datetime(x[\"observation_date\"]),\n value=lambda x: pd.to_numeric(\n x[series], errors=\"coerce\"\n ),\n series=series,\n )\n .get([\"date\", \"series\", \"value\"])\n .query(\"date >= @start_date & date <= @end_date\")\n .assign(cpi=lambda x: x[\"value\"] / x[\"value\"].iloc[-1])\n)\n```\n:::\n\n\nThe last line sets the current (latest) price level as the reference price level.\n\nThe `tidyfinance` package can, of course, also fetch the same index data and many more data series:\n\n::: {#b94f5bd0 .cell execution_count=22}\n``` {.python .cell-code}\ntf.download_data(\n domain=\"fred\",\n series = \"CPIAUCNS\",\n start_date = start_date,\n end_date = end_date\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=50}\n```{=html}\n
\n\n
\n \n
\n
\n
date
\n
series
\n
value
\n
\n \n \n
\n
0
\n
1960-01-01
\n
CPIAUCNS
\n
29.300
\n
\n
\n
1
\n
1960-02-01
\n
CPIAUCNS
\n
29.400
\n
\n
\n
2
\n
1960-03-01
\n
CPIAUCNS
\n
29.400
\n
\n
\n
3
\n
1960-04-01
\n
CPIAUCNS
\n
29.500
\n
\n
\n
4
\n
1960-05-01
\n
CPIAUCNS
\n
29.500
\n
\n
\n
...
\n
...
\n
...
\n
...
\n
\n
\n
775
\n
2024-08-01
\n
CPIAUCNS
\n
314.796
\n
\n
\n
776
\n
2024-09-01
\n
CPIAUCNS
\n
315.301
\n
\n
\n
777
\n
2024-10-01
\n
CPIAUCNS
\n
315.664
\n
\n
\n
778
\n
2024-11-01
\n
CPIAUCNS
\n
315.493
\n
\n
\n
779
\n
2024-12-01
\n
CPIAUCNS
\n
315.605
\n
\n \n
\n
780 rows × 3 columns
\n
\n```\n:::\n:::\n\n\nTo download other time series, we just have to look it up on the FRED website and extract the corresponding key from the address. For instance, the producer price index for gold ores can be found under the [PCU2122212122210](https://fred.stlouisfed.org/series/PCU2122212122210) key. If your desired time series is not supported through tidyfinance, we recommend working with the `fredapi` package. Note that you need to get an API key to use its functionality. We refer to the package documentation for details.\n\n## Setting Up a Database\n\nNow that we have downloaded some (freely available) data from the web into the memory of our Python session, let us set up a database to store that information for future use. We will use the data stored in this database throughout the following chapters, but you could alternatively implement a different strategy and replace the respective code. \n\nThere are many ways to set up and organize a database, depending on the use case. For our purpose, the most efficient way is to use an [SQLite](https://SQLite.org/)-database, which is the C-language library that implements a small, fast, self-contained, high-reliability, full-featured SQL database engine. Note that [SQL](https://en.wikipedia.org/wiki/SQL) (Structured Query Language) is a standard language for accessing and manipulating databases.\\index{Database!SQLite}\n\n::: {#b825745c .cell execution_count=23}\n``` {.python .cell-code}\nimport sqlite3\n```\n:::\n\n\nAn SQLite-database is easily created - the code below is really all there is. You do not need any external software. Otherwise, date columns are stored and retrieved as integers.\\index{Database!Creation} We will use the file `tidy_finance_r.sqlite`, located in the data subfolder, to retrieve data for all subsequent chapters. The initial part of the code ensures that the directory is created if it does not already exist.\n\n::: {#ac03dbae .cell execution_count=24}\n``` {.python .cell-code}\nimport os\n\nif not os.path.exists(\"data\"):\n os.makedirs(\"data\")\n \ntidy_finance = sqlite3.connect(database=\"data/tidy_finance_python.sqlite\")\n```\n:::\n\n\nNext, we create a remote table with the monthly Fama-French factor data. We do so with the `pandas` function `to_sql()`, which copies the data to our SQLite-database.\n\n::: {#244fccf8 .cell execution_count=25}\n``` {.python .cell-code}\n(factors_ff3_monthly\n .to_sql(name=\"factors_ff3_monthly\", \n con=tidy_finance, \n if_exists=\"replace\",\n index=False)\n)\n```\n:::\n\n\nNow, if we want to have the whole table in memory, we need to call `pd.read_sql_query()` with the corresponding query. You will see that we regularly load the data into the memory in the next chapters.\\index{Database!Read}\n\n::: {#dcab3728 .cell execution_count=26}\n``` {.python .cell-code}\npd.read_sql_query(\n sql=\"SELECT date, risk_free FROM factors_ff3_monthly\",\n con=tidy_finance,\n parse_dates={\"date\"}\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=54}\n```{=html}\n
\n\n
\n \n
\n
\n
date
\n
risk_free
\n
\n \n \n
\n
0
\n
1960-01-01
\n
0.0033
\n
\n
\n
1
\n
1960-02-01
\n
0.0029
\n
\n
\n
2
\n
1960-03-01
\n
0.0035
\n
\n
\n
3
\n
1960-04-01
\n
0.0019
\n
\n
\n
4
\n
1960-05-01
\n
0.0027
\n
\n
\n
...
\n
...
\n
...
\n
\n
\n
775
\n
2024-08-01
\n
0.0048
\n
\n
\n
776
\n
2024-09-01
\n
0.0040
\n
\n
\n
777
\n
2024-10-01
\n
0.0039
\n
\n
\n
778
\n
2024-11-01
\n
0.0040
\n
\n
\n
779
\n
2024-12-01
\n
0.0037
\n
\n \n
\n
780 rows × 2 columns
\n
\n```\n:::\n:::\n\n\nThe last couple of code chunks are really all there is to organizing a simple database! You can also share the SQLite database across devices and programming languages. \n\nBefore we move on to the next data source, let us also store the other six tables in our new SQLite database. \n\n::: {#adf2106d .cell execution_count=27}\n``` {.python .cell-code}\ndata_dict = {\n \"factors_ff5_monthly\": factors_ff5_monthly,\n \"factors_ff3_daily\": factors_ff3_daily,\n \"industries_ff_monthly\": industries_ff_monthly, \n \"factors_q_monthly\": factors_q_monthly,\n \"macro_predictors\": macro_predictors,\n \"cpi_monthly\": cpi_monthly\n}\n\nfor key, value in data_dict.items():\n value.to_sql(name=key,\n con=tidy_finance, \n if_exists=\"replace\",\n index=False)\n```\n:::\n\n\nFrom now on, all you need to do to access data that is stored in the database is to follow two steps: (i) Establish the connection to the SQLite-database and (ii) execute the query to fetch the data. For your convenience, the following steps show all you need in a compact fashion.\\index{Database!Connection}\n\n::: {#6487b384 .cell message='false' results='false' execution_count=28}\n``` {.python .cell-code}\nimport pandas as pd\nimport sqlite3\n\ntidy_finance = sqlite3.connect(database=\"data/tidy_finance_python.sqlite\")\n\nfactors_q_monthly = pd.read_sql_query(\n sql=\"SELECT * FROM factors_q_monthly\",\n con=tidy_finance,\n parse_dates={\"date\"}\n)\n```\n:::\n\n\n## Managing SQLite Databases\n\nFinally, at the end of our data chapter, we revisit the SQLite database itself. When you drop database objects such as tables or delete data from tables, the database file size remains unchanged because SQLite just marks the deleted objects as free and reserves their space for future uses. As a result, the database file always grows in size.\\index{Database!Management}\n\nTo optimize the database file, you can run the `VACUUM` command in the database, which rebuilds the database and frees up unused space. You can execute the command in the database using the `execute()` function. \n\n::: {#28341992 .cell execution_count=29}\n``` {.python .cell-code}\ntidy_finance.execute(\"VACUUM\")\n```\n:::\n\n\nThe `VACUUM` command actually performs a couple of additional cleaning steps, which you can read about in [this tutorial.](https://SQLite.org/docs/sql/statements/vacuum.html) \\index{Database!Cleaning}\n\n## Key Takeaways\n\n- Importing Fama-French factors, q-factors, macroeconomic indicators, and CPI data is simplified through API calls, CSV parsing, and web scraping techniques.\n- The `tidyfinance` Python package offers pre-processed access to financial datasets, reducing manual data cleaning and saving valuable time.\n- Creating a centralized SQLite database helps manage and organize data efficiently across projects, while maintaining reproducibility.\n- Structured database storage supports scalable data access, which is essential for long-term academic projects and collaborative work in finance.\n\n## Exercises\n\n1. Download the monthly Fama-French factors manually from [Kenneth French's data library](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html) and read them in via `pd.read_csv()`. Validate that you get the same data as via the `tf.download_data()` package. \n1. Download the daily Fama-French 5 factors using the `tf.download_data()` function. After the successful download and conversion to the column format that we used above, compare the `risk_free`, `mkt_excess`, `smb`, and `hml` columns of `factors_ff3_daily` to `factors_ff5_daily`. Discuss any differences you might find. \n\n",
"supporting": [
"accessing-and-managing-financial-data_files"
],
"filters": [],
"includes": {
"include-in-header": [
- "\n\n\n"
+ "\n\n\n"
]
}
}
diff --git a/_freeze/r/accessing-and-managing-financial-data/execute-results/html.json b/_freeze/r/accessing-and-managing-financial-data/execute-results/html.json
index 6a81122a..646eed21 100644
--- a/_freeze/r/accessing-and-managing-financial-data/execute-results/html.json
+++ b/_freeze/r/accessing-and-managing-financial-data/execute-results/html.json
@@ -1,8 +1,8 @@
{
- "hash": "58cf68fe24e6b0c8b028f74da0c95bf6",
+ "hash": "20e6e3325026e6ce999b7fe854cd3f10",
"result": {
"engine": "knitr",
- "markdown": "---\ntitle: Accessing and Managing Financial Data\naliases: \n - ../accessing-and-managing-financial-data.html\nmetadata:\n pagetitle: Accessing and Managing Financial Data with R\n description-meta: Download and organize open-source financial data using the programming language R. \n---\n\n::: callout-note\nYou are reading **Tidy Finance with R**. You can find the equivalent chapter for the sibling **Tidy Finance with Python** [here](../python/accessing-and-managing-financial-data.qmd).\n:::\n\nIn this chapter, we suggest a way to organize your financial data. Everybody who has experience with data is also familiar with storing data in various formats like CSV, XLS, XLSX, or other delimited value storage. Reading and saving data can become very cumbersome in the case of using different data formats, both across different projects and across different programming languages. Moreover, storing data in delimited files often leads to problems with respect to column type consistency. For instance, date-type columns frequently lead to inconsistencies across different data formats and programming languages.\n\nThis chapter shows how to import different open source data sets. Specifically, our data comes from the application programming interface (API) of Yahoo Finance, a downloaded standard CSV file, an XLSX file stored in a public Google Drive repository, and other macroeconomic time series that can be scraped directly from a website.\\index{API}\\index{Web scraping} We show how to process these raw data, as well as how to take a shortcut using the `tidyfinance` package, which provides a consistent interface to tidy financial data. We store all the data in a *single* database, which serves as the only source of data in subsequent chapters. We conclude the chapter by providing some tips on managing databases.\\index{Database}\n\nFirst, we load the global R packages that we use throughout this chapter. Later on, we load more packages in the sections where we need them.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(tidyfinance)\nlibrary(scales)\n```\n:::\n\n\nMoreover, we initially define the date range for which we fetch and store the financial data, making future data updates tractable. In case you need another time frame, you can adjust the dates below. Our data starts with 1960 since most asset pricing studies use data from 1962 on.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstart_date <- ymd(\"1960-01-01\")\nend_date <- ymd(\"2024-12-31\")\n```\n:::\n\n\n## Fama-French Data\n\nWe start by downloading some famous Fama-French factors [e.g., @Fama1993] and portfolio returns commonly used in empirical asset pricing. Fortunately, there is a neat package by [Nelson Areal](https://github.com/nareal/frenchdata/) that allows us to access the data easily: the `frenchdata` package provides functions to download and read data sets from [Prof. Kenneth French finance data library](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html) [@frenchdata].\\index{Data!Fama-French factors} \\index{Kenneth French homepage}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(frenchdata)\n```\n:::\n\n\nWe can use the `download_french_data()` function of the package to download monthly Fama-French factors. The set *Fama/French 3 Factors* contains the return time series of the market `mkt_excess`, size `smb` and value `hml` alongside the risk-free rates `rf`. Note that we have to do some manual work to correctly parse all the columns and scale them appropriately, as the raw Fama-French data comes in a very unpractical data format. For precise descriptions of the variables, we suggest consulting Prof. Kenneth French's finance data library directly. If you are on the website, check the raw data files to appreciate the time you can save thanks to `frenchdata`.\\index{Factor!Market}\\index{Factor!Size}\\index{Factor!Value}\\index{Factor!Profitability}\\index{Factor!Investment}\\index{Risk-free rate}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_monthly_raw <- download_french_data(\"Fama/French 3 Factors\")\nfactors_ff3_monthly <- factors_ff3_monthly_raw$subsets$data[[1]] |>\n mutate(\n date = floor_date(ymd(str_c(date, \"01\")), \"month\"),\n across(c(RF, `Mkt-RF`, SMB, HML), ~as.numeric(.) / 100),\n .keep = \"none\"\n ) |>\n rename_with(str_to_lower) |>\n rename(mkt_excess = `mkt-rf`) |> \n filter(date >= start_date & date <= end_date)\n```\n:::\n\n\nWe also download the set *5 Factors (2x3)*, which additionally includes the return time series of the profitability `rmw` and investment `cma` factors. We demonstrate how the monthly factors are constructed in the chapter [Replicating Fama and French Factors](replicating-fama-and-french-factors.qmd).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff5_monthly_raw <- download_french_data(\"Fama/French 5 Factors (2x3)\")\n\nfactors_ff5_monthly <- factors_ff5_monthly_raw$subsets$data[[1]] |>\n mutate(\n date = floor_date(ymd(str_c(date, \"01\")), \"month\"),\n across(c(RF, `Mkt-RF`, SMB, HML, RMW, CMA), ~as.numeric(.) / 100),\n .keep = \"none\"\n ) |>\n rename_with(str_to_lower) |>\n rename(mkt_excess = `mkt-rf`) |> \n filter(date >= start_date & date <= end_date)\n```\n:::\n\n\nIt is straightforward to download the corresponding *daily* Fama-French factors with the same function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_daily_raw <- download_french_data(\"Fama/French 3 Factors [Daily]\")\n\nfactors_ff3_daily <- factors_ff3_daily_raw$subsets$data[[1]] |>\n mutate(\n date = ymd(date),\n across(c(RF, `Mkt-RF`, SMB, HML), ~as.numeric(.) / 100),\n .keep = \"none\"\n ) |>\n rename_with(str_to_lower) |>\n rename(mkt_excess = `mkt-rf`) |>\n filter(date >= start_date & date <= end_date)\n```\n:::\n\n\nIn a subsequent chapter, we also use the 10 monthly industry portfolios, so let us fetch that data, too.\\index{Data!Industry portfolios}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nindustries_ff_monthly_raw <- download_french_data(\"10 Industry Portfolios\")\n\nindustries_ff_monthly <- industries_ff_monthly_raw$subsets$data[[1]] |>\n mutate(date = floor_date(ymd(str_c(date, \"01\")), \"month\")) |>\n mutate(across(where(is.numeric), ~ . / 100)) |>\n select(date, everything()) |>\n filter(date >= start_date & date <= end_date) |> \n rename_with(str_to_lower)\n```\n:::\n\n\nIt is worth taking a look at all available portfolio return time series from Kenneth French's homepage. You should check out the other sets by calling `get_french_data_list()`. \n\nTo automatically download and process Fama-French data, you can also use the `tidyfinance` package with `type = \"factors_ff_3_monthly\"` or similar, e.g.:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndownload_data(\n type = \"factors_ff_3_monthly\", \n start_date = start_date, \n end_date = end_date\n)\n```\n:::\n\n\nThe `tidyfinance` package implements the processing steps as above and returns the same cleaned data frame. The list of supported Fama-French data types can be called as follows:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlist_supported_types(domain = \"Fama-French\")\n```\n:::\n\n\n## q-Factors\n\nIn recent years, the academic discourse experienced the rise of alternative factor models, e.g., in the form of the @Hou2015 *q*-factor model. We refer to the [extended background](http://global-q.org/background.html) information provided by the original authors for further information. The *q* factors can be downloaded directly from the authors' homepage from within `read_csv()`.\\index{Data!q-factors}\\index{Factor!q-factors}\n\nWe also need to adjust this data. First, we discard information we will not use in the remainder of the book. Then, we rename the columns with the \"R\\_\"-prescript using regular expressions and write all column names in lowercase. You should always try sticking to a consistent style for naming objects, which we try to illustrate here - the emphasis is on *try*. You can check out style guides available online, e.g., [Hadley Wickham's `tidyverse` style guide.](https://style.tidyverse.org/index.html)\\index{Style guide}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_q_monthly_link <-\n \"https://global-q.org/uploads/1/2/2/6/122679606/q5_factors_monthly_2023.csv\"\n\nfactors_q_monthly <- read_csv(factors_q_monthly_link) |>\n mutate(date = ymd(str_c(year, month, \"01\", sep = \"-\"))) |>\n rename_with(~str_remove(., \"R_\")) |>\n rename_with(str_to_lower) |>\n mutate(across(-date, ~. / 100)) |>\n select(date, risk_free = f, mkt_excess = mkt, everything()) |>\n filter(date >= start_date & date <= end_date)\n```\n:::\n\n\nAgain, you can use the `tidyfinance` package for a shortcut:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndownload_data(\n type = \"factors_q5_monthly\", \n start_date = start_date, \n end_date = end_date\n)\n```\n:::\n\n\n## Macroeconomic Predictors\n\nOur next data source is a set of macroeconomic variables often used as predictors for the equity premium. @Goyal2008 comprehensively reexamine the performance of variables suggested by the academic literature to be good predictors of the equity premium. The authors host the data updated to 2022 on [Amit Goyal's website.](https://sites.google.com/view/agoyal145) The data is an XLSX-file stored on a public Google drive location and we directly export a CSV file.\\index{Data!Macro predictors}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsheet_id <- \"1bM7vCWd3WOt95Sf9qjLPZjoiafgF_8EG\"\nsheet_name <- \"Monthly\"\nmacro_predictors_url <- paste0(\n \"https://docs.google.com/spreadsheets/d/\", sheet_id,\n \"/gviz/tq?tqx=out:csv&sheet=\", sheet_name\n)\nmacro_predictors_raw <- read_csv(macro_predictors_url)\n```\n:::\n\n\nNext, we transform the columns into the variables that we later use:\n\n1. The dividend price ratio (`dp`), the difference between the log of dividends and the log of prices, where dividends are 12-month moving sums of dividends paid on the S&P 500 index, and prices are monthly averages of daily closing prices [@Campbell1988; @Campbell2006].\n2. Dividend yield (`dy`), the difference between the log of dividends and the log of lagged prices [@Ball1978].\n3. Earnings price ratio (`ep`), the difference between the log of earnings and the log of prices, where earnings are 12-month moving sums of earnings on the S&P 500 index [@Campbell1988].\n4. Dividend payout ratio (`de`), the difference between the log of dividends and the log of earnings [@Lamont1998].\n5. Stock variance (`svar`), the sum of squared daily returns on the S&P 500 index [@Guo2006].\n6. Book-to-market ratio (`bm`), the ratio of book value to market value for the Dow Jones Industrial Average [@Kothari1997].\n7. Net equity expansion (`ntis`), the ratio of 12-month moving sums of net issues by NYSE listed stocks divided by the total end-of-year market capitalization of NYSE stocks [@Campbell2008].\n8. Treasury bills (`tbl`), the 3-Month Treasury Bill: Secondary Market Rate from the economic research database at the Federal Reserve Bank at St. Louis [@Campbell1987].\n9. Long-term yield (`lty`), the long-term government bond yield from Ibbotson's Stocks, Bonds, Bills, and Inflation Yearbook [@Goyal2008].\n10. Long-term rate of returns (`ltr`), the long-term government bond returns from Ibbotson's Stocks, Bonds, Bills, and Inflation Yearbook [@Goyal2008].\n11. Term spread (`tms`), the difference between the long-term yield on government bonds and the Treasury bill [@Campbell1987].\n12. Default yield spread (`dfy`), the difference between BAA and AAA-rated corporate bond yields [@Fama1989].\n13. Inflation (`infl`), the Consumer Price Index (All Urban Consumers) from the Bureau of Labor Statistics [@Campbell2004].\n\nFor variable definitions and the required data transformations, you can consult the material on [Amit Goyal's website](https://sites.google.com/view/agoyal145).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmacro_predictors <- macro_predictors_raw |>\n mutate(date = ym(yyyymm)) |>\n mutate(across(where(is.character), as.numeric)) |>\n mutate(\n IndexDiv = Index + D12,\n logret = log(IndexDiv) - log(lag(IndexDiv)),\n Rfree = log(Rfree + 1),\n rp_div = lead(logret - Rfree, 1), # Future excess market return\n dp = log(D12) - log(Index), # Dividend Price ratio\n dy = log(D12) - log(lag(Index)), # Dividend yield\n ep = log(E12) - log(Index), # Earnings price ratio\n de = log(D12) - log(E12), # Dividend payout ratio\n tms = lty - tbl, # Term spread\n dfy = BAA - AAA # Default yield spread\n ) |>\n select(\n date, rp_div, dp, dy, ep, de, svar,\n bm = `b/m`, ntis, tbl, lty, ltr,\n tms, dfy, infl\n ) |>\n filter(date >= start_date & date <= end_date) |>\n drop_na()\n```\n:::\n\n\nTo get the equivalent data through `tidyfinance`, you can call:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndownload_data(\n type = \"macro_predictors_monthly\",\n start_date = start_date,\n end_date = end_date\n)\n```\n:::\n\n\n## Other Macroeconomic Data\n\nThe Federal Reserve bank of St. Louis provides the Federal Reserve Economic Data (FRED), an extensive database for macroeconomic data. In total, there are 817,000 US and international time series from 108 different sources. The data can be downloaded directly from FRED by constructing the appropriate URL. For instance, let us consider the consumer price index (CPI) data that can be found under the [CPIAUCNS](https://fred.stlouisfed.org/series/CPIAUCNS):\\index{Data!FRED}\\index{Data!CPI}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseries <- \"CPIAUCNS\"\ncpi_url <- paste0(\n \"https://fred.stlouisfed.org/graph/fredgraph.csv?id=\", series\n)\n```\n:::\n\n\nWe can then use the `httr2` [@httr2] package to request the CSV, extract the data from the response body, and convert the columns to a tidy format:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(httr2)\n\ncpi_daily <- request(cpi_url) |>\n req_perform() |>\n resp_body_string() |>\n read_csv() |>\n mutate(\n date = as.Date(observation_date),\n value = as.numeric(.data[[series]]),\n series = series,\n .keep = \"none\"\n )\n```\n:::\n\n\nWe convert the daily CPI data to monthly because we use the latter in later chapters. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ncpi_monthly <- cpi_daily |>\n mutate(\n date = floor_date(date, \"month\"),\n cpi = value / value[date == max(date)],\n .keep = \"none\"\n )\n```\n:::\n\n\nThe `tidyfinance` package can, of course, also fetch the same daily data and many more data series:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndownload_data(\n type = \"fred\",\n series = \"CPIAUCNS\",\n start_date = start_date,\n end_date = end_date\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 0 × 3\n# ℹ 3 variables: date , value , series \n```\n\n\n:::\n:::\n\n\nTo download other time series, we just have to look it up on the FRED website and extract the corresponding key from the address. For instance, the producer price index for gold ores can be found under the [PCU2122212122210](https://fred.stlouisfed.org/series/PCU2122212122210) key. If your desired time series is not supported through `tidyfinance`, we recommend working with the `fredr` package [@fredr]. Note that you need to get an API key to use its functionality. We refer to the package documentation for details.\n\n## Setting Up a Database\n\nNow that we have downloaded some (freely available) data from the web into the memory of our R session let us set up a database to store that information for future use. We will use the data stored in this database throughout the following chapters, but you could alternatively implement a different strategy and replace the respective code.\n\nThere are many ways to set up and organize a database, depending on the use case. For our purpose, the most efficient way is to use an [SQLite](https://www.sqlite.org/index.html) database, which is the C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine. Note that [SQL](https://en.wikipedia.org/wiki/SQL) (Structured Query Language) is a standard language for accessing and manipulating databases and heavily inspired the `dplyr` functions. We refer to [this tutorial](https://www.w3schools.com/sql/sql_intro.asp) for more information on SQL.\\index{Database!SQLite}\n\nThere are two packages that make working with SQLite in R very simple: `RSQLite` [@RSQLite] embeds the SQLite database engine in R, and `dbplyr` [@dbplyr] is the database back-end for `dplyr`. These packages allow to set up a database to remotely store tables and use these remote database tables as if they are in-memory data frames by automatically converting `dplyr` into SQL. Check out the [`RSQLite`](https://cran.r-project.org/web/packages/RSQLite/vignettes/RSQLite.html) and [`dbplyr`](https://db.rstudio.com/databases/sqlite/) vignettes for more information.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(RSQLite)\nlibrary(dbplyr)\n```\n:::\n\n\nAn SQLite database is easily created - the code below is really all there is. You do not need any external software. Note that we use the `extended_types = TRUE` option to enable date types when storing and fetching data. Otherwise, date columns are stored and retrieved as integers.\\index{Database!Creation} We will use the file `tidy_finance_r.sqlite`, located in the data subfolder, to retrieve data for all subsequent chapters. The initial part of the code ensures that the directory is created if it does not already exist.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nif (!dir.exists(\"data\")) {\n dir.create(\"data\")\n}\n\ntidy_finance <- dbConnect(\n SQLite(),\n \"data/tidy_finance_r.sqlite\",\n extended_types = TRUE\n)\n```\n:::\n\n\nNext, we create a remote table with the monthly Fama-French factor data. We do so with the function `dbWriteTable()`, which copies the data to our SQLite-database.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbWriteTable(\n tidy_finance,\n \"factors_ff3_monthly\",\n value = factors_ff3_monthly,\n overwrite = TRUE\n)\n```\n:::\n\n\nWe can use the remote table as an in-memory data frame by building a connection via `tbl()`.\\index{Database!Remote connection}\n\n\n::: {.cell}\n\n:::\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_monthly_db <- tbl(tidy_finance, \"factors_ff3_monthly\")\n```\n:::\n\n\nAll `dplyr` calls are evaluated lazily, i.e., the data is not in our R session's memory, and the database does most of the work. You can see that by noticing that the output below does not show the number of rows. In fact, the following code chunk only fetches the top 10 rows from the database for printing.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_monthly_db |>\n select(date, rf)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# Source: SQL [?? x 2]\n# Database: sqlite 3.47.1 [data/tidy_finance_r.sqlite]\n date rf\n \n1 1960-01-01 0.0033\n2 1960-02-01 0.0029\n3 1960-03-01 0.0035\n4 1960-04-01 0.0019\n5 1960-05-01 0.0027\n# ℹ more rows\n```\n\n\n:::\n:::\n\n\nIf we want to have the whole table in memory, we need to `collect()` it. You will see that we regularly load the data into the memory in the next chapters.\\index{Database!Fetch}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_monthly_db |>\n select(date, rf) |>\n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 780 × 2\n date rf\n \n1 1960-01-01 0.0033\n2 1960-02-01 0.0029\n3 1960-03-01 0.0035\n4 1960-04-01 0.0019\n5 1960-05-01 0.0027\n# ℹ 775 more rows\n```\n\n\n:::\n:::\n\n\nThe last couple of code chunks is really all there is to organizing a simple database! You can also share the SQLite database across devices and programming languages.\n\nBefore we move on to the next data source, let us also store the other five tables in our new SQLite database.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbWriteTable(\n tidy_finance,\n \"factors_ff5_monthly\",\n value = factors_ff5_monthly,\n overwrite = TRUE\n)\n\ndbWriteTable(\n tidy_finance,\n \"factors_ff3_daily\",\n value = factors_ff3_daily,\n overwrite = TRUE\n)\n\ndbWriteTable(\n tidy_finance,\n \"industries_ff_monthly\",\n value = industries_ff_monthly,\n overwrite = TRUE\n)\n\ndbWriteTable(\n tidy_finance,\n \"factors_q_monthly\",\n value = factors_q_monthly,\n overwrite = TRUE\n)\n\ndbWriteTable(\n tidy_finance,\n \"macro_predictors\",\n value = macro_predictors,\n overwrite = TRUE\n)\n\ndbWriteTable(\n tidy_finance,\n \"cpi_monthly\",\n value = cpi_monthly,\n overwrite = TRUE\n)\n```\n:::\n\n\nFrom now on, all you need to do to access data that is stored in the database is to follow three steps: (i) Establish the connection to the SQLite database, (ii) call the table you want to extract, and (iii) collect the data. For your convenience, the following steps show all you need in a compact fashion.\\index{Database!Connection}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\n\ntidy_finance <- dbConnect(\n SQLite(),\n \"data/tidy_finance_r.sqlite\",\n extended_types = TRUE\n)\n\nfactors_q_monthly <- tbl(tidy_finance, \"factors_q_monthly\")\nfactors_q_monthly <- factors_q_monthly |> collect()\n```\n:::\n\n\n## Managing SQLite Databases\n\nFinally, at the end of our data chapter, we revisit the SQLite database itself. When you drop database objects such as tables or delete data from tables, the database file size remains unchanged because SQLite just marks the deleted objects as free and reserves their space for future uses. As a result, the database file always grows in size.\\index{Database!Management}\n\nTo optimize the database file, you can run the `VACUUM` command in the database, which rebuilds the database and frees up unused space. You can execute the command in the database using the `dbSendQuery()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nres <- dbSendQuery(tidy_finance, \"VACUUM\")\nres\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n SQL VACUUM\n ROWS Fetched: 0 [complete]\n Changed: 0\n```\n\n\n:::\n:::\n\n\nThe `VACUUM` command actually performs a couple of additional cleaning steps, which you can read about in [this tutorial.](https://www.sqlitetutorial.net/sqlite-vacuum/) \\index{Database!Cleaning}\n\nWe store the result of the above query in `res` because the database keeps the result set open. To close open results and avoid warnings going forward, we can use `dbClearResult()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbClearResult(res)\n```\n:::\n\n\nApart from cleaning up, you might be interested in listing all the tables that are currently in your database. You can do this via the `dbListTables()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbListTables(tidy_finance)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"cpi_monthly\" \"factors_ff3_daily\" \n[3] \"factors_ff3_monthly\" \"factors_ff5_monthly\" \n[5] \"factors_q_monthly\" \"industries_ff_monthly\"\n[7] \"macro_predictors\" \n```\n\n\n:::\n:::\n\n\nThis function comes in handy if you are unsure about the correct naming of the tables in your database.\n\n## Key Takeaways\n\n- Importing Fama-French factors, q-factors, macroeconomic indicators, and CPI data is simplified through API calls, CSV parsing, and web scraping techniques.\n- The `tidyfinance` R package offers pre-processed access to financial datasets, reducing manual data cleaning and saving valuable time.\n- Creating a centralized SQLite database helps manage and organize data efficiently across projects, while maintaining reproducibility.\n- Structured database storage supports scalable data access, which is essential for long-term academic projects and collaborative work in finance.\n\n## Exercises\n\n1. Download the monthly Fama-French factors manually from [Ken French's data library](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html) and read them in via `read_csv()`. Validate that you get the same data as via the `frenchdata` package.\n2. Download the daily Fama-French 5 factors using the `frenchdata` package. Use `get_french_data_list()` to find the corresponding table name. After the successful download and conversion to the column format that we used above, compare the `rf`, `mkt_excess`, `smb`, and `hml` columns of `factors_ff3_daily` to `factors_ff5_daily`. Discuss any differences you might find.\n",
+ "markdown": "---\ntitle: Accessing and Managing Financial Data\naliases: \n - ../accessing-and-managing-financial-data.html\nmetadata:\n pagetitle: Accessing and Managing Financial Data with R\n description-meta: Download and organize open-source financial data using the programming language R. \n---\n\n::: callout-note\nYou are reading **Tidy Finance with R**. You can find the equivalent chapter for the sibling **Tidy Finance with Python** [here](../python/accessing-and-managing-financial-data.qmd).\n:::\n\nIn this chapter, we suggest a way to organize your financial data. Everybody who has experience with data is also familiar with storing data in various formats like CSV, XLS, XLSX, or other delimited value storage. Reading and saving data can become very cumbersome in the case of using different data formats, both across different projects and across different programming languages. Moreover, storing data in delimited files often leads to problems with respect to column type consistency. For instance, date-type columns frequently lead to inconsistencies across different data formats and programming languages.\n\nThis chapter shows how to import different open source data sets. Specifically, our data comes from the application programming interface (API) of Yahoo Finance, a downloaded standard CSV file, an XLSX file stored in a public Google Drive repository, and other macroeconomic time series that can be scraped directly from a website.\\index{API}\\index{Web scraping} We show how to process these raw data, as well as how to take a shortcut using the `tidyfinance` package, which provides a consistent interface to tidy financial data. We store all the data in a *single* database, which serves as the only source of data in subsequent chapters. We conclude the chapter by providing some tips on managing databases.\\index{Database}\n\nFirst, we load the global R packages that we use throughout this chapter. Later on, we load more packages in the sections where we need them.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(tidyfinance)\nlibrary(scales)\n```\n:::\n\n\nMoreover, we initially define the date range for which we fetch and store the financial data, making future data updates tractable. In case you need another time frame, you can adjust the dates below. Our data starts with 1960 since most asset pricing studies use data from 1962 on.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstart_date <- ymd(\"1960-01-01\")\nend_date <- ymd(\"2024-12-31\")\n```\n:::\n\n\n## Fama-French Data\n\nWe start by downloading some famous Fama-French factors [e.g., @Fama1993] and portfolio returns commonly used in empirical asset pricing. Fortunately, there is a neat package by [Nelson Areal](https://github.com/nareal/frenchdata/) that allows us to access the data easily: the `frenchdata` package provides functions to download and read data sets from [Prof. Kenneth French finance data library](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html) [@frenchdata].\\index{Data!Fama-French factors} \\index{Kenneth French homepage}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(frenchdata)\n```\n:::\n\n\nWe can use the `download_french_data()` function of the package to download monthly Fama-French factors. The set *Fama/French 3 Factors* contains the return time series of the market `mkt_excess`, size `smb` and value `hml` alongside the risk-free rates `rf`. Note that we have to do some manual work to correctly parse all the columns and scale them appropriately, as the raw Fama-French data comes in a very unpractical data format. For precise descriptions of the variables, we suggest consulting Prof. Kenneth French's finance data library directly. If you are on the website, check the raw data files to appreciate the time you can save thanks to `frenchdata`.\\index{Factor!Market}\\index{Factor!Size}\\index{Factor!Value}\\index{Factor!Profitability}\\index{Factor!Investment}\\index{Risk-free rate}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_monthly_raw <- download_french_data(\"Fama/French 3 Factors\")\nfactors_ff3_monthly <- factors_ff3_monthly_raw$subsets$data[[1]] |>\n mutate(\n date = floor_date(ymd(str_c(date, \"01\")), \"month\"),\n across(c(RF, `Mkt-RF`, SMB, HML), ~as.numeric(.) / 100),\n .keep = \"none\"\n ) |>\n rename_with(str_to_lower) |>\n rename(mkt_excess = `mkt-rf`) |> \n filter(date >= start_date & date <= end_date)\n```\n:::\n\n\nWe also download the set *5 Factors (2x3)*, which additionally includes the return time series of the profitability `rmw` and investment `cma` factors. We demonstrate how the monthly factors are constructed in the chapter [Replicating Fama and French Factors](replicating-fama-and-french-factors.qmd).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff5_monthly_raw <- download_french_data(\"Fama/French 5 Factors (2x3)\")\n\nfactors_ff5_monthly <- factors_ff5_monthly_raw$subsets$data[[1]] |>\n mutate(\n date = floor_date(ymd(str_c(date, \"01\")), \"month\"),\n across(c(RF, `Mkt-RF`, SMB, HML, RMW, CMA), ~as.numeric(.) / 100),\n .keep = \"none\"\n ) |>\n rename_with(str_to_lower) |>\n rename(mkt_excess = `mkt-rf`) |> \n filter(date >= start_date & date <= end_date)\n```\n:::\n\n\nIt is straightforward to download the corresponding *daily* Fama-French factors with the same function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_daily_raw <- download_french_data(\"Fama/French 3 Factors [Daily]\")\n\nfactors_ff3_daily <- factors_ff3_daily_raw$subsets$data[[1]] |>\n mutate(\n date = ymd(date),\n across(c(RF, `Mkt-RF`, SMB, HML), ~as.numeric(.) / 100),\n .keep = \"none\"\n ) |>\n rename_with(str_to_lower) |>\n rename(mkt_excess = `mkt-rf`) |>\n filter(date >= start_date & date <= end_date)\n```\n:::\n\n\nIn a subsequent chapter, we also use the 10 monthly industry portfolios, so let us fetch that data, too.\\index{Data!Industry portfolios}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nindustries_ff_monthly_raw <- download_french_data(\"10 Industry Portfolios\")\n\nindustries_ff_monthly <- industries_ff_monthly_raw$subsets$data[[1]] |>\n mutate(date = floor_date(ymd(str_c(date, \"01\")), \"month\")) |>\n mutate(across(where(is.numeric), ~ . / 100)) |>\n select(date, everything()) |>\n filter(date >= start_date & date <= end_date) |> \n rename_with(str_to_lower)\n```\n:::\n\n\nIt is worth taking a look at all available portfolio return time series from Kenneth French's homepage. You should check out the other sets by calling `get_french_data_list()`. \n\nTo automatically download and process Fama-French data, you can also use the `tidyfinance` package with `type = \"factors_ff_3_monthly\"` or similar, e.g.:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndownload_data(\n type = \"factors_ff_3_monthly\", \n start_date = start_date, \n end_date = end_date\n)\n```\n:::\n\n\nThe `tidyfinance` package implements the processing steps as above and returns the same cleaned data frame. The list of supported Fama-French data types can be called as follows:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlist_supported_types(domain = \"Fama-French\")\n```\n:::\n\n\n## q-Factors\n\nIn recent years, the academic discourse experienced the rise of alternative factor models, e.g., in the form of the @Hou2015 *q*-factor model. We refer to the [extended background](http://global-q.org/background.html) information provided by the original authors for further information. The *q* factors can be downloaded directly from the authors' homepage from within `read_csv()`.\\index{Data!q-factors}\\index{Factor!q-factors}\n\nWe also need to adjust this data. First, we discard information we will not use in the remainder of the book. Then, we rename the columns with the \"R\\_\"-prescript using regular expressions and write all column names in lowercase. You should always try sticking to a consistent style for naming objects, which we try to illustrate here - the emphasis is on *try*. You can check out style guides available online, e.g., [Hadley Wickham's `tidyverse` style guide.](https://style.tidyverse.org/index.html)\\index{Style guide}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_q_monthly_link <-\n \"https://global-q.org/uploads/1/2/2/6/122679606/q5_factors_monthly_2023.csv\"\n\nfactors_q_monthly <- read_csv(factors_q_monthly_link) |>\n mutate(date = ymd(str_c(year, month, \"01\", sep = \"-\"))) |>\n rename_with(~str_remove(., \"R_\")) |>\n rename_with(str_to_lower) |>\n mutate(across(-date, ~. / 100)) |>\n select(date, risk_free = f, mkt_excess = mkt, everything()) |>\n filter(date >= start_date & date <= end_date)\n```\n:::\n\n\nAgain, you can use the `tidyfinance` package for a shortcut:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndownload_data(\n type = \"factors_q5_monthly\", \n start_date = start_date, \n end_date = end_date\n)\n```\n:::\n\n\n## Macroeconomic Predictors\n\nOur next data source is a set of macroeconomic variables often used as predictors for the equity premium. @Goyal2008 comprehensively reexamine the performance of variables suggested by the academic literature to be good predictors of the equity premium. The authors host the data updated to 2022 on [Amit Goyal's website.](https://sites.google.com/view/agoyal145) The data is an XLSX-file stored on a public Google drive location and we directly export a CSV file.\\index{Data!Macro predictors}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsheet_id <- \"1bM7vCWd3WOt95Sf9qjLPZjoiafgF_8EG\"\nsheet_name <- \"Monthly\"\nmacro_predictors_url <- paste0(\n \"https://docs.google.com/spreadsheets/d/\", sheet_id,\n \"/gviz/tq?tqx=out:csv&sheet=\", sheet_name\n)\nmacro_predictors_raw <- read_csv(macro_predictors_url)\n```\n:::\n\n\nNext, we transform the columns into the variables that we later use:\n\n1. The dividend price ratio (`dp`), the difference between the log of dividends and the log of prices, where dividends are 12-month moving sums of dividends paid on the S&P 500 index, and prices are monthly averages of daily closing prices [@Campbell1988; @Campbell2006].\n2. Dividend yield (`dy`), the difference between the log of dividends and the log of lagged prices [@Ball1978].\n3. Earnings price ratio (`ep`), the difference between the log of earnings and the log of prices, where earnings are 12-month moving sums of earnings on the S&P 500 index [@Campbell1988].\n4. Dividend payout ratio (`de`), the difference between the log of dividends and the log of earnings [@Lamont1998].\n5. Stock variance (`svar`), the sum of squared daily returns on the S&P 500 index [@Guo2006].\n6. Book-to-market ratio (`bm`), the ratio of book value to market value for the Dow Jones Industrial Average [@Kothari1997].\n7. Net equity expansion (`ntis`), the ratio of 12-month moving sums of net issues by NYSE listed stocks divided by the total end-of-year market capitalization of NYSE stocks [@Campbell2008].\n8. Treasury bills (`tbl`), the 3-Month Treasury Bill: Secondary Market Rate from the economic research database at the Federal Reserve Bank at St. Louis [@Campbell1987].\n9. Long-term yield (`lty`), the long-term government bond yield from Ibbotson's Stocks, Bonds, Bills, and Inflation Yearbook [@Goyal2008].\n10. Long-term rate of returns (`ltr`), the long-term government bond returns from Ibbotson's Stocks, Bonds, Bills, and Inflation Yearbook [@Goyal2008].\n11. Term spread (`tms`), the difference between the long-term yield on government bonds and the Treasury bill [@Campbell1987].\n12. Default yield spread (`dfy`), the difference between BAA and AAA-rated corporate bond yields [@Fama1989].\n13. Inflation (`infl`), the Consumer Price Index (All Urban Consumers) from the Bureau of Labor Statistics [@Campbell2004].\n\nFor variable definitions and the required data transformations, you can consult the material on [Amit Goyal's website](https://sites.google.com/view/agoyal145).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmacro_predictors <- macro_predictors_raw |>\n mutate(date = ym(yyyymm)) |>\n mutate(across(where(is.character), as.numeric)) |>\n mutate(\n IndexDiv = Index + D12,\n logret = log(IndexDiv) - log(lag(IndexDiv)),\n Rfree = log(Rfree + 1),\n rp_div = lead(logret - Rfree, 1), # Future excess market return\n dp = log(D12) - log(Index), # Dividend Price ratio\n dy = log(D12) - log(lag(Index)), # Dividend yield\n ep = log(E12) - log(Index), # Earnings price ratio\n de = log(D12) - log(E12), # Dividend payout ratio\n tms = lty - tbl, # Term spread\n dfy = BAA - AAA # Default yield spread\n ) |>\n select(\n date, rp_div, dp, dy, ep, de, svar,\n bm = `b/m`, ntis, tbl, lty, ltr,\n tms, dfy, infl\n ) |>\n filter(date >= start_date & date <= end_date) |>\n drop_na()\n```\n:::\n\n\nTo get the equivalent data through `tidyfinance`, you can call:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndownload_data(\n type = \"macro_predictors_monthly\",\n start_date = start_date,\n end_date = end_date\n)\n```\n:::\n\n\n## Other Macroeconomic Data\n\nThe Federal Reserve bank of St. Louis provides the Federal Reserve Economic Data (FRED), an extensive database for macroeconomic data. In total, there are 817,000 US and international time series from 108 different sources. The data can be downloaded directly from FRED by constructing the appropriate URL. For instance, let us consider the consumer price index (CPI) data that can be found under the [CPIAUCNS](https://fred.stlouisfed.org/series/CPIAUCNS):\\index{Data!FRED}\\index{Data!CPI}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseries <- \"CPIAUCNS\"\ncpi_url <- paste0(\n \"https://fred.stlouisfed.org/graph/fredgraph.csv?id=\", series\n)\n```\n:::\n\n\nWe can then use the `httr2` [@httr2] package to request the CSV, extract the data from the response body, and convert the columns to a tidy format:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(httr2)\n\nresp <- request(cpi_url) |> \n req_perform()\nresp_csv <- resp |> \n resp_body_string() \n\ncpi_monthly <- resp_csv |> \n read_csv() |>\n mutate(\n date = as.Date(observation_date),\n value = as.numeric(.data[[series]]),\n series = series,\n .keep = \"none\"\n ) |>\n filter(date >= start_date & date <= end_date) |> \n mutate(\n cpi = value / value[date == max(date)]\n )\n```\n:::\n\n\nThe last line sets the current (latest) price level as the reference price level.\n\nThe `tidyfinance` package can, of course, also fetch the same index data and many more data series:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndownload_data(\n type = \"fred\",\n series = \"CPIAUCNS\",\n start_date = start_date,\n end_date = end_date\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 0 × 3\n# ℹ 3 variables: date , value , series \n```\n\n\n:::\n:::\n\n\nTo download other time series, we just have to look it up on the FRED website and extract the corresponding key from the address. For instance, the producer price index for gold ores can be found under the [PCU2122212122210](https://fred.stlouisfed.org/series/PCU2122212122210) key. If your desired time series is not supported through `tidyfinance`, we recommend working with the `fredr` package [@fredr]. Note that you need to get an API key to use its functionality. We refer to the package documentation for details.\n\n## Setting Up a Database\n\nNow that we have downloaded some (freely available) data from the web into the memory of our R session let us set up a database to store that information for future use. We will use the data stored in this database throughout the following chapters, but you could alternatively implement a different strategy and replace the respective code.\n\nThere are many ways to set up and organize a database, depending on the use case. For our purpose, the most efficient way is to use an [SQLite](https://www.sqlite.org/index.html) database, which is the C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine. Note that [SQL](https://en.wikipedia.org/wiki/SQL) (Structured Query Language) is a standard language for accessing and manipulating databases and heavily inspired the `dplyr` functions. We refer to [this tutorial](https://www.w3schools.com/sql/sql_intro.asp) for more information on SQL.\\index{Database!SQLite}\n\nThere are two packages that make working with SQLite in R very simple: `RSQLite` [@RSQLite] embeds the SQLite database engine in R, and `dbplyr` [@dbplyr] is the database back-end for `dplyr`. These packages allow to set up a database to remotely store tables and use these remote database tables as if they are in-memory data frames by automatically converting `dplyr` into SQL. Check out the [`RSQLite`](https://cran.r-project.org/web/packages/RSQLite/vignettes/RSQLite.html) and [`dbplyr`](https://db.rstudio.com/databases/sqlite/) vignettes for more information.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(RSQLite)\nlibrary(dbplyr)\n```\n:::\n\n\nAn SQLite database is easily created - the code below is really all there is. You do not need any external software. Note that we use the `extended_types = TRUE` option to enable date types when storing and fetching data. Otherwise, date columns are stored and retrieved as integers.\\index{Database!Creation} We will use the file `tidy_finance_r.sqlite`, located in the data subfolder, to retrieve data for all subsequent chapters. The initial part of the code ensures that the directory is created if it does not already exist.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nif (!dir.exists(\"data\")) {\n dir.create(\"data\")\n}\n\ntidy_finance <- dbConnect(\n SQLite(),\n \"data/tidy_finance_r.sqlite\",\n extended_types = TRUE\n)\n```\n:::\n\n\nNext, we create a remote table with the monthly Fama-French factor data. We do so with the function `dbWriteTable()`, which copies the data to our SQLite-database.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbWriteTable(\n tidy_finance,\n \"factors_ff3_monthly\",\n value = factors_ff3_monthly,\n overwrite = TRUE\n)\n```\n:::\n\n\nWe can use the remote table as an in-memory data frame by building a connection via `tbl()`.\\index{Database!Remote connection}\n\n\n::: {.cell}\n\n:::\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_monthly_db <- tbl(tidy_finance, \"factors_ff3_monthly\")\n```\n:::\n\n\nAll `dplyr` calls are evaluated lazily, i.e., the data is not in our R session's memory, and the database does most of the work. You can see that by noticing that the output below does not show the number of rows. In fact, the following code chunk only fetches the top 10 rows from the database for printing.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_monthly_db |>\n select(date, rf)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# Source: SQL [?? x 2]\n# Database: sqlite 3.47.1 [data/tidy_finance_r.sqlite]\n date rf\n \n1 1960-01-01 0.0033\n2 1960-02-01 0.0029\n3 1960-03-01 0.0035\n4 1960-04-01 0.0019\n5 1960-05-01 0.0027\n# ℹ more rows\n```\n\n\n:::\n:::\n\n\nIf we want to have the whole table in memory, we need to `collect()` it. You will see that we regularly load the data into the memory in the next chapters.\\index{Database!Fetch}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_monthly_db |>\n select(date, rf) |>\n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 780 × 2\n date rf\n \n1 1960-01-01 0.0033\n2 1960-02-01 0.0029\n3 1960-03-01 0.0035\n4 1960-04-01 0.0019\n5 1960-05-01 0.0027\n# ℹ 775 more rows\n```\n\n\n:::\n:::\n\n\nThe last couple of code chunks is really all there is to organizing a simple database! You can also share the SQLite database across devices and programming languages.\n\nBefore we move on to the next data source, let us also store the other five tables in our new SQLite database.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbWriteTable(\n tidy_finance,\n \"factors_ff5_monthly\",\n value = factors_ff5_monthly,\n overwrite = TRUE\n)\n\ndbWriteTable(\n tidy_finance,\n \"factors_ff3_daily\",\n value = factors_ff3_daily,\n overwrite = TRUE\n)\n\ndbWriteTable(\n tidy_finance,\n \"industries_ff_monthly\",\n value = industries_ff_monthly,\n overwrite = TRUE\n)\n\ndbWriteTable(\n tidy_finance,\n \"factors_q_monthly\",\n value = factors_q_monthly,\n overwrite = TRUE\n)\n\ndbWriteTable(\n tidy_finance,\n \"macro_predictors\",\n value = macro_predictors,\n overwrite = TRUE\n)\n\ndbWriteTable(\n tidy_finance,\n \"cpi_monthly\",\n value = cpi_monthly,\n overwrite = TRUE\n)\n```\n:::\n\n\nFrom now on, all you need to do to access data that is stored in the database is to follow three steps: (i) Establish the connection to the SQLite database, (ii) call the table you want to extract, and (iii) collect the data. For your convenience, the following steps show all you need in a compact fashion.\\index{Database!Connection}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\n\ntidy_finance <- dbConnect(\n SQLite(),\n \"data/tidy_finance_r.sqlite\",\n extended_types = TRUE\n)\n\nfactors_q_monthly <- tbl(tidy_finance, \"factors_q_monthly\")\nfactors_q_monthly <- factors_q_monthly |> collect()\n```\n:::\n\n\n## Managing SQLite Databases\n\nFinally, at the end of our data chapter, we revisit the SQLite database itself. When you drop database objects such as tables or delete data from tables, the database file size remains unchanged because SQLite just marks the deleted objects as free and reserves their space for future uses. As a result, the database file always grows in size.\\index{Database!Management}\n\nTo optimize the database file, you can run the `VACUUM` command in the database, which rebuilds the database and frees up unused space. You can execute the command in the database using the `dbSendQuery()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nres <- dbSendQuery(tidy_finance, \"VACUUM\")\nres\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n SQL VACUUM\n ROWS Fetched: 0 [complete]\n Changed: 0\n```\n\n\n:::\n:::\n\n\nThe `VACUUM` command actually performs a couple of additional cleaning steps, which you can read about in [this tutorial.](https://www.sqlitetutorial.net/sqlite-vacuum/) \\index{Database!Cleaning}\n\nWe store the result of the above query in `res` because the database keeps the result set open. To close open results and avoid warnings going forward, we can use `dbClearResult()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbClearResult(res)\n```\n:::\n\n\nApart from cleaning up, you might be interested in listing all the tables that are currently in your database. You can do this via the `dbListTables()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbListTables(tidy_finance)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] \"beta\" \"compustat\" \n [3] \"cpi_monthly\" \"crsp_daily\" \n [5] \"crsp_monthly\" \"factors_ff3_daily\" \n [7] \"factors_ff3_monthly\" \"factors_ff5_monthly\" \n [9] \"factors_q_monthly\" \"fisd\" \n[11] \"industries_ff_monthly\" \"macro_predictors\" \n[13] \"trace_enhanced\" \n```\n\n\n:::\n:::\n\n\nThis function comes in handy if you are unsure about the correct naming of the tables in your database.\n\n## Key Takeaways\n\n- Importing Fama-French factors, q-factors, macroeconomic indicators, and CPI data is simplified through API calls, CSV parsing, and web scraping techniques.\n- The `tidyfinance` R package offers pre-processed access to financial datasets, reducing manual data cleaning and saving valuable time.\n- Creating a centralized SQLite database helps manage and organize data efficiently across projects, while maintaining reproducibility.\n- Structured database storage supports scalable data access, which is essential for long-term academic projects and collaborative work in finance.\n\n## Exercises\n\n1. Download the monthly Fama-French factors manually from [Ken French's data library](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html) and read them in via `read_csv()`. Validate that you get the same data as via the `frenchdata` package.\n2. Download the daily Fama-French 5 factors using the `frenchdata` package. Use `get_french_data_list()` to find the corresponding table name. After the successful download and conversion to the column format that we used above, compare the `rf`, `mkt_excess`, `smb`, and `hml` columns of `factors_ff3_daily` to `factors_ff5_daily`. Discuss any differences you might find.\n",
"supporting": [],
"filters": [
"rmarkdown/pagebreak.lua"
diff --git a/docs/accessing-and-managing-financial-data.html b/docs/accessing-and-managing-financial-data.html
index 0aa4e119..2c36535a 100644
--- a/docs/accessing-and-managing-financial-data.html
+++ b/docs/accessing-and-managing-financial-data.html
@@ -6,6 +6,10 @@
var hash = window.location.hash.startsWith('#') ? window.location.hash.slice(1) : window.location.hash;
var redirect = redirects[hash] || redirects[""] || "/";
window.document.title = 'Redirect to ' + redirect;
+ if (!redirects[hash]) {
+ redirect = redirect + window.location.hash;
+ }
+ redirect = redirect + window.location.search;
window.location.replace(redirect);
diff --git a/docs/python/accessing-and-managing-financial-data.html b/docs/python/accessing-and-managing-financial-data.html
index 681f17c8..8e787a63 100644
--- a/docs/python/accessing-and-managing-financial-data.html
+++ b/docs/python/accessing-and-managing-financial-data.html
@@ -2,7 +2,7 @@
-
+
@@ -79,7 +79,7 @@
}
-
+
@@ -92,14 +92,15 @@
+
-
+
-
+
-
+
@@ -173,7 +174,7 @@
var macros = [];
for (var i = 0; i < mathElements.length; i++) {
var texText = mathElements[i].firstChild;
- if (mathElements[i].tagName == "SPAN") {
+ if (mathElements[i].tagName == "SPAN" && texText && texText.data) {
window.katex.render(texText.data, mathElements[i], {
displayMode: mathElements[i].classList.contains('display'),
throwOnError: false,
@@ -204,7 +205,8 @@