From 074a3153d9c2cf6476018c52fb113e5849a91f4b Mon Sep 17 00:00:00 2001 From: Marija Selakovic Date: Sun, 14 Sep 2025 21:32:20 +0200 Subject: [PATCH 1/4] pandas: Starter tutorial --- docs/connect/df/index.md | 50 ++------------ docs/integrate/index.md | 1 + docs/integrate/pandas/index.md | 58 ++++++++++++++++ docs/integrate/pandas/tutorial-start.md | 90 +++++++++++++++++++++++++ 4 files changed, 153 insertions(+), 46 deletions(-) create mode 100644 docs/integrate/pandas/index.md create mode 100644 docs/integrate/pandas/tutorial-start.md diff --git a/docs/connect/df/index.md b/docs/connect/df/index.md index 50cc9e01..2ca00900 100644 --- a/docs/connect/df/index.md +++ b/docs/connect/df/index.md @@ -40,50 +40,10 @@ the Python libraries that you know and love, like NumPy, pandas, and scikit-lear - [Dask code examples] -(pandas)= ## pandas - -:::{rubric} About -::: - -```{div} -:style: "float: right" -[![](https://pandas.pydata.org/static/img/pandas.svg){w=180px}](https://pandas.pydata.org/) -``` - -[pandas] is a fast, powerful, flexible, and easy-to-use open-source data analysis -and manipulation tool, built on top of the Python programming language. - -Pandas (stylized as pandas) is a software library written for the Python programming -language for data manipulation and analysis. In particular, it offers data structures -and operations for manipulating numerical tables and time series. - -:::{rubric} Data Model -::: -- Pandas is built around data structures called Series and DataFrames. Data for these - collections can be imported from various file formats such as comma-separated values, - JSON, Parquet, SQL database tables or queries, and Microsoft Excel. -- A Series is a 1-dimensional data structure built on top of NumPy's array. -- Pandas includes support for time series, such as the ability to interpolate values - and filter using a range of timestamps. -- By default, a Pandas index is a series of integers ascending from 0, similar to the - indices of Python arrays. However, indices can use any NumPy data type, including - floating point, timestamps, or strings. -- Pandas supports hierarchical indices with multiple values per data point. An index - with this structure, called a "MultiIndex", allows a single DataFrame to represent - multiple dimensions, similar to a pivot table in Microsoft Excel. Each level of a - MultiIndex can be given a unique name. - -```{div} -:style: "clear: both" -``` - -:::{rubric} Learn +:::{seealso} +Please navigate to the dedicated page about {ref}`pandas`. ::: -- [Guide to efficient data ingestion to CrateDB with pandas] -- [Importing Parquet files into CrateDB using Apache Arrow and SQLAlchemy] -- [pandas code examples] -- [From data storage to data analysis: Tutorial on CrateDB and pandas] ## Polars @@ -96,13 +56,11 @@ Please navigate to the dedicated page about {ref}`polars`. [Dask]: https://www.dask.org/ [Dask DataFrames]: https://docs.dask.org/en/latest/dataframe.html [Dask Futures]: https://docs.dask.org/en/latest/futures.html -[pandas]: https://pandas.pydata.org/ +[Polars]: https://pola.rs/ [Dask code examples]: https://github.com/crate/cratedb-examples/tree/main/by-dataframe/dask [Efficient batch/bulk INSERT operations with pandas, Dask, and SQLAlchemy]: https://cratedb.com/docs/python/en/latest/by-example/sqlalchemy/dataframe.html -[From data storage to data analysis: Tutorial on CrateDB and pandas]: https://community.cratedb.com/t/from-data-storage-to-data-analysis-tutorial-on-cratedb-and-pandas/1440 -[Guide to efficient data ingestion to CrateDB with pandas]: https://community.cratedb.com/t/guide-to-efficient-data-ingestion-to-cratedb-with-pandas/1541 [Guide to efficient data ingestion to CrateDB with pandas and Dask]: https://community.cratedb.com/t/guide-to-efficient-data-ingestion-to-cratedb-with-pandas-and-dask/1482 [Import weather data using Dask]: https://github.com/crate/cratedb-examples/blob/main/topic/timeseries/dask-weather-data-import.ipynb [Importing Parquet files into CrateDB using Apache Arrow and SQLAlchemy]: https://community.cratedb.com/t/importing-parquet-files-into-cratedb-using-apache-arrow-and-sqlalchemy/1161 -[pandas code examples]: https://github.com/crate/cratedb-examples/tree/main/by-dataframe/pandas +[Polars code examples]: https://github.com/crate/cratedb-examples/tree/main/by-dataframe/polars diff --git a/docs/integrate/index.md b/docs/integrate/index.md index 73f1793d..4fa31fb1 100644 --- a/docs/integrate/index.md +++ b/docs/integrate/index.md @@ -56,6 +56,7 @@ n8n/index nifi/index node-red/index oracle/index +pandas/index plotly/index polars/index postgresql/index diff --git a/docs/integrate/pandas/index.md b/docs/integrate/pandas/index.md new file mode 100644 index 00000000..ec259c3e --- /dev/null +++ b/docs/integrate/pandas/index.md @@ -0,0 +1,58 @@ +(pandas)= +# pandas + +```{div} +:style: "float: right" +[![](https://pandas.pydata.org/static/img/pandas.svg){w=180px}](https://pandas.pydata.org/) +``` +```{div} .clearfix +``` + +:::{rubric} About +::: + +[pandas] is a fast, powerful, flexible, and easy-to-use open-source data analysis +and manipulation tool, built on top of the Python programming language. + +Pandas (stylized as pandas) is a software library written for the Python programming +language for data manipulation and analysis. In particular, it offers data structures +and operations for manipulating numerical tables and time series. + +:::{rubric} Data Model +::: +- Pandas is built around data structures called Series and DataFrames. Data for these + collections can be imported from various file formats such as comma-separated values, + JSON, Parquet, SQL database tables or queries, and Microsoft Excel. +- A Series is a 1-dimensional data structure built on top of NumPy's array. +- Pandas includes support for time series, such as the ability to interpolate values + and filter using a range of timestamps. +- By default, a Pandas index is a series of integers ascending from 0, similar to the + indices of Python arrays. However, indices can use any NumPy data type, including + floating point, timestamps, or strings. +- Pandas supports hierarchical indices with multiple values per data point. An index + with this structure, called a "MultiIndex", allows a single DataFrame to represent + multiple dimensions, similar to a pivot table in Microsoft Excel. Each level of a + MultiIndex can be given a unique name. + + +:::{rubric} Learn +::: +- {ref}`pandas-tutorial-start` +- [Importing Parquet files into CrateDB using Apache Arrow and SQLAlchemy] +- [Guide to efficient data ingestion to CrateDB with pandas] +- [pandas code examples] + + +:::{toctree} +:maxdepth: 1 +:hidden: +Starter tutorial +Jupyter tutorial +::: + + +[Efficient batch/bulk INSERT operations with pandas, Dask, and SQLAlchemy]: https://cratedb.com/docs/python/en/latest/by-example/sqlalchemy/dataframe.html +[Guide to efficient data ingestion to CrateDB with pandas]: https://community.cratedb.com/t/guide-to-efficient-data-ingestion-to-cratedb-with-pandas/1541 +[Importing Parquet files into CrateDB using Apache Arrow and SQLAlchemy]: https://community.cratedb.com/t/importing-parquet-files-into-cratedb-using-apache-arrow-and-sqlalchemy/1161 +[pandas]: https://pandas.pydata.org/ +[pandas code examples]: https://github.com/crate/cratedb-examples/tree/main/by-dataframe/pandas diff --git a/docs/integrate/pandas/tutorial-start.md b/docs/integrate/pandas/tutorial-start.md new file mode 100644 index 00000000..3464691d --- /dev/null +++ b/docs/integrate/pandas/tutorial-start.md @@ -0,0 +1,90 @@ +(pandas-tutorial-start)= +# From data storage to data analysis: Tutorial on CrateDB and pandas + +## Introduction + +Pandas is an open-source data manipulation and analysis library for Python. It is widely used for handling and analyzing data in a variety of fields, including finance, research, etc. + +One of the key benefits of pandas is its ability to handle and manipulate large datasets, making it a valuable tool for data scientists and analysts. The library provides easy-to-use data structures and functions for data cleaning, transformation, and analysis, making it an essential part of the data analysis workflow. + +Using CrateDB and pandas together can be a powerful combination for handling large volumes of data and performing complex data analysis tasks. In this tutorial, we will showcase using the real-world dataset how to use CrateDB and pandas together for effective data analysis. + +## Requirements + +To follow along with this tutorial, you will need: + +* A running instance of CrateDB 5.2. +* Python 3.x with the [pandas 2](https://pandas.pydata.org/pandas-docs/version/2.0/whatsnew/v2.0.0.html) and [crate 0.31](https://github.com/crate/crate-python) packages installed. +* A real-world dataset in CSV format. In this tutorial, we will be using the shop customer data available on [Kaggle](https://www.kaggle.com/datasets/datascientistanna/customers-dataset). + +## Setting up CrateDB + +Before we can start using CrateDB, we need to set it up. You can either download and install CrateDB locally via [Docker](https://crate.io/docs/crate/tutorials/en/latest/basic/index.html#docker) or [tarball](https://crate.io/docs/crate/tutorials/en/latest/basic/index.html#try-cratedb-without-installing) or use a [CrateDB Cloud](https://crate.io/download?hsCtaTracking=caa20047-f2b6-4e8c-b7f9-63fbf818b17f%7Cf1ad6eaa-39ac-49cd-8115-ed7d5dac4d63) instance with an option of the free cluster. + +Once you have a running instance of CrateDB, create a new table to store the customer data dataset. Here is an SQL command to create a table: + +``` sql +CREATE TABLE IF NOT EXISTS "doc"."customer_data" ( + "customerid" INTEGER, + "gender" TEXT, + "age" INTEGER, + "annualincome" INTEGER, + "spendingscore" INTEGER, + "profession" TEXT, + "workexperience" INTEGER, + "familysize" INTEGER +); +``` + +After creating the table, you can import the customer data dataset into CrateDB using the `COPY FROM `command: + +```sql +COPY "doc"."customer_data" FROM 'file:///path/to/Customers.csv' WITH (format='csv', delimiter=','); +``` + +Once you have CrateDB running, you can start exploring data with pandas. + +## Querying data with CrateDB and pandas + +The first step is to import the `pandas` library and specify the query you want to execute on CrateDB. In our example, we want to fetch all customer data. + +To read data from CrateDB and work with it in a pandas DataFrame use `read_sql` method as illustrated below. + + +```python +import pandas as pd + +query = "SELECT * FROM customer_data" +df = pd.read_sql(query, 'crate://localhost:4200') +``` + +In the above code, we establish a connection to a local CrateDB instance running on localhost on port 4200, execute a SQL query, and return the results as a pandas DataFrame. You can further modify the query to retrieve only the columns you need or to filter the data based on some condition. + + + +## Analyze the data + +Now that data are loaded into the pandas DataFrame, we can perform various analyses and manipulations on it. For instance, we can group the data by a certain column and calculate the average value of another column: + +```python +avg_income = df.groupby("profession")["annualincome"].mean() +``` + +In this example, we group the data in the DataFrame by the `profession` column and calculate the average annual income for each profession. You can plot the data about average incomes using `df.plot()` method, specifying the type of plot (a bar chart), and the columns to use for the x and y axes: + +```python +import matplotlib.pyplot as plt + +income_by_profession.plot(kind='bar', legend=True, rot=0) +plot.show() +``` + +We also use `plt.show()` from `matplotlib` to display the plot: + +![python-plot|690x479, 100%](https://us1.discourse-cdn.com/flex020/uploads/crate/original/1X/ab652c811106f6a79b911a443bb8c11099f55b98.png) + + +## Wrap up + +That's it! You should now have a good idea of how to use CrateDB and pandas together to analyze large datasets stored in CrateDB. This allows you to take advantage of the powerful data manipulation capabilities of pandas to analyze and visualize your data. +To learn more about updates, features, and other questions you might have, join our [CrateDB](https://community.cratedb.com/) community. From 84ffbcb1af35a0e2a69e208885bd602a0bb5fe9b Mon Sep 17 00:00:00 2001 From: Rafaela Santana Date: Sun, 14 Sep 2025 21:33:26 +0200 Subject: [PATCH 2/4] pandas: Jupyter tutorial --- docs/integrate/pandas/index.md | 1 + docs/integrate/pandas/tutorial-jupyter.md | 510 ++++++++++++++++++++++ 2 files changed, 511 insertions(+) create mode 100644 docs/integrate/pandas/tutorial-jupyter.md diff --git a/docs/integrate/pandas/index.md b/docs/integrate/pandas/index.md index ec259c3e..8d4c353e 100644 --- a/docs/integrate/pandas/index.md +++ b/docs/integrate/pandas/index.md @@ -38,6 +38,7 @@ and operations for manipulating numerical tables and time series. :::{rubric} Learn ::: - {ref}`pandas-tutorial-start` +- {ref}`pandas-tutorial-jupyter` - [Importing Parquet files into CrateDB using Apache Arrow and SQLAlchemy] - [Guide to efficient data ingestion to CrateDB with pandas] - [pandas code examples] diff --git a/docs/integrate/pandas/tutorial-jupyter.md b/docs/integrate/pandas/tutorial-jupyter.md new file mode 100644 index 00000000..bbf77282 --- /dev/null +++ b/docs/integrate/pandas/tutorial-jupyter.md @@ -0,0 +1,510 @@ +(pandas-tutorial-jupyter)= +# Automating financial data collection and storage in CrateDB with Python and pandas + +This tutorial will teach you how to automatically collect historical data from S&P-500 +companies and store it all in CrateDB using Python. + +**tl;dr**: I will go through how to + +* import S&P-500 companies’ data with the Yahoo! Finance API into a Jupyter Notebook, +* set up a connection to CrateDB with Python, +* create functions to create tables, insert values, and retrieve data from CrateDB, +* upload finance market data into CrateDB + +Before anything else, I must make sure I have my setup ready. + +So, let’s get started. + +## Setting up CrateDB, Jupyter, and Python + +### CrateDB + +If you’re new to CrateDB and want to get started quickly and easily, a great option is to try the **Free Tier** in CrateDB Cloud. With the **Free Tier**, you have a limited Cluster that is free forever; no payment method is required. Now, if you are ready to experience the full power of CrateDB Cloud, take advantage of the 200$ in free credits to try the cluster of your dreams. + +To start with CrateDB Cloud, [navigate to the CrateDB website](https://crate.io/download?hsCtaTracking=caa20047-f2b6-4e8c-b7f9-63fbf818b17f%7Cf1ad6eaa-39ac-49cd-8115-ed7d5dac4d63) and follow the steps to create your CrateDB Cloud account. Once you log in to the CrateDB Cloud UI, select **Deploy Cluster** to create your free cluster, and you are ready to go! + +![cratedb-cloud-free-tier](https://us1.discourse-cdn.com/flex020/uploads/crate/original/1X/26fc603ca998d39631f93f1eb7c5dbd30f437e56.gif) + +With my CrateDB Cluster up and running, I can ensure Python is set up. + +### Python + +Python is a good fit for this project: it’s simple, highly readable, and has valuable analytics libraries for free. + +I download [Python](https://www.python.org/downloads/), then reaccess the terminal to check if Python was installed and which version I have with the command +`pip3 --version`, +which tells me I have Python 3.9 installed. + +All set! + +### Jupyter + +The [Jupyter Notebook](https://jupyter.org/) is an open-source web application that creates and shares documents containing live code, equations, visualizations, and narrative text. + +A Jupyter Notebook is an excellent environment for this project. It contains executable documents (the code) and human-readable documents (tables, figures, etc.) in the same place! + +I follow the [Jupiter Installation tutorial](https://jupyter.org/install.html) for the Notebook, which is quickly done with Python and the terminal command +`pip3 install notebook` +and now I run the Notebook (using Jupyter 1.0.0) with the command +`jupyter notebook` + +Setup done! + +Now I can access my Jupyter Notebook by opening the URL printed in the terminal after running this last command. In my case, it is at http://localhost:8888/ + +## Creating a Notebook + +On Jupyter’s main page, I navigate to the **New** button on the top right and select **Python 3 (ipykernel)** + +![Jupyter Notebook](https://us1.discourse-cdn.com/flex020/uploads/crate/original/1X/b53c7cbe73ef268108e856e19bf946d4cf0d987b.png){w=800px} + +An empty notebook opens. + +To make sure everything works before starting my project, I + +* call the notebook “financial-data-with-cratedb”, +* write a ‘Hello World!’ line with + +```python +print('Hello World!') +``` + +* run the code snippet by pressing `Alt` + `Enter` (or clicking on the **Run** button) + +![Hello World program](https://us1.discourse-cdn.com/flex020/uploads/crate/original/1X/e25efa994ab0afefe946e2ec5b99d4b9b31cfad8.jpeg){w=800px} + +Great, it works! Now I can head to the following steps to download the financial data. + +## Getting all S&P-500 ticker symbols from Wikipedia + +When I read [yfinance](https://pypi.org/project/yfinance/)’s documentation (version 0.1.63), I find the `history` function, which gets a ticker symbol as a parameter and downloads the data from this company. + +I want to download data from all S&P-500 companies, so having a list with all their symbols would be perfect. + +I then found [this tutorial by Edoardo Romani](https://towardsdatascience.com/how-to-automate-financial-data-collection-with-python-using-tiingo-api-and-google-cloud-platform-b11d8c9afaa1), which shows how to get the symbols from the [List of S&P-500 companies' Wikipedia page](https://en.wikipedia.org/wiki/List_of_S%26P_500_companies) and store them in a list. + +So, in my Notebook, I import [BeautifulSoup 4.10.0](https://beautiful-soup-4.readthedocs.io/en/latest/) and [requests 2.26.0](https://pypi.org/project/requests/) to pull out HTML files from Wikipedia and create the following function: + +```python +import requests +from bs4 import BeautifulSoup + +def get_sp500_ticker_symbols(): + + # getting html from SP500 Companies List wikipedia page + + url = "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies" + r = requests.get(url,timeout = 2.5) + r_html = r.text + soup = BeautifulSoup(r_html, 'html.parser') + + # getting rows from wikipedia's table + + components_table = soup.find_all(id = "constituents") + data_rows = components_table[0].find("tbody").find_all("tr")[1:] + + # extracting ticker symbols from the data rows + + tickers = [] + for row in range(len(data_rows)): + stock = list(filter(None, data_rows[row].text.split("\n"))) + symbol = stock[0] + if (symbol.find('.') != -1): + symbol = symbol.replace('.', '-') + tickers.append(symbol) + tickers.sort() + return tickers +``` + +What this function does is: + +* it finds the S&P-500 companies table components in the Wikipedia page’s HTML code +* it extracts the table rows from the components and stores it in the `data_rows` variable +* it splits `data_rows` into the `stock` list, where each element contains information about one stock (Symbol, Security, SEC filings, …) +* it takes the Symbol for each `stock` list element and adds it to the `tickers` list +* finally, it sorts the `tickers` list in alphabetical order and returns it + +To check if it works, I will call this function and print the results with + +```python +tickers = get_sp500_ticker_symbols() +print(tickers) +``` + +and it looks like this: + +![get_sp500_ticker_symbols()](https://us1.discourse-cdn.com/flex020/uploads/crate/original/1X/49935a7dbb2153ee7f6a24d99c44132f562e1e20.png){w=800px} + +Now that I have a list of all the stock tickers, I can move on and download their data with `yfinance`. + +## Downloading financial data with yfinance + +[Pandas](https://pandas.pydata.org/) is a famous package in Python, often used for Data Science. It shortens the process of handling data, has complete yet straightforward data representation forms, and makes tasks like filtering data easy. + +Its key data structure is called a DataFrame, which allows storage and manipulation of tabular data: in this case, the columns are going to be the financial variables (such as “date”, “ticker”, “closing price”…) and the rows are going to be filled with data about the S&P-500 companies. + +So, the first thing I do is import the `yfinance`(0.1.63) and `pandas`(2.0.0) + +``` +import yfinance as yf +import pandas as pd +``` +And now, I have designed a function to download the data from a company from a given period. + +First, I create a `data` DataFrame to store the stocks' `closing_date`, `ticker`, and `close_value`. + +I get the data from the ticker on that period with the `Ticker.history` function from `yfinance`. I store the result in the `history` DataFrame, rename the index (which contains the date) to `closing_date`, as this is the column name I prefer for CrateDB, and then reset the index. Instead of having the date as the index, I have a column called `closing_date`, which has the date information, and the rows are indexed trivially (like 0, 1, 2, …). I also add a `ticker` column containing the current ticker and rename the `Close` column to match the `close_value` name in the `data` DataFrame. Finally, I add the `closing_date`, `ticker`, and `close_value` data for that ticker to my `data` DataFrame. + +The function returns the `data` DataFrame containing the `closing_date`, `ticker`, and `close_value` data for the given ticker over the `period`. + +This is what `download_data` looks like: +```python +def download_data(ticker, period): + + data = pd.DataFrame(columns=['closing_date', 'ticker', 'close_value']) + + # downloading history for this ticker ticker + + info = yf.Ticker(ticker) + history = info.history(period=period) + history.index.names = ['closing_date'] + history.reset_index(inplace=True) + + # adding a column for the ticker + + history['ticker'] = ticker + + # renaming column to fit into dataframe + + history.rename(columns={'Close': 'close_value'}, inplace=True) + + # adding values to the dataframe + + data = pd.concat( + [data, history[['closing_date', 'ticker', 'close_value']]]) + + return data +``` + +To check if everything works, I execute the function and store it in the `my_data` variable, and print the result: + +```python +my_data = download_data('AAPL', '1mo') +my_data +``` +and it looks like this: + +![calling-download-data|690x348](https://us1.discourse-cdn.com/flex020/uploads/crate/original/1X/6174430a9bd94b8956e2ba012267ca67a335d53b.png){w=800px} + + +## Connecting to CrateDB + +In the **Overview** tab of my CrateDB Cloud Cluster I find several ways to connect to CrateDB with CLI, Python, JavaScript, among others. So I select the **Python** option and choose one of the variants, such as **psycopg2**(version 2.9.1). + +![connections-for-cratedb-cloud|690x386](https://us1.discourse-cdn.com/flex020/uploads/crate/original/1X/2891e21d7ad9cd34eed068153285530badb0dc66.png){w=800px} + +I copy the code to connect and add my password to it in the `` field. It creates a `conn` variable, which stores the connection, and a `cursor` variable, which allows Python code to execute PostgreSQL commands. I adapt the code slightly so I leave the `cursor` open to use it later on. It then looks like this: +```python +# pip install psycopg2-binary +import psycopg2 as ps + +conn = ps.connect(host="", port=5432, user="admin", password="", sslmode="require") +cursor = conn.cursor() +cursor.execute("SELECT name FROM sys.cluster") +result = cursor.fetchone() +print(result) +``` +When I run this code it prints `('my-cluster',)`, which is the name I have to my cluster, so the connection works! + +Now I can create more functions to create tables in CrateDB, insert my data values into a table, and retrieve data! + +## Creating functions for CrateDB + +### Creating table + +I will have the `closing_date`, `ticker`, and `close_value` columns in my table. Also, I want to give the table name as a parameter and only create a new table in case the table does not exist yet. I use the SQL keywords `CREATE TABLE IF NOT EXISTS` in my function. + +Now I must create the complete statement as a string and execute it with the `cursor.execute`command: + +```python +def create_table(table_name): + columns = "(closing_date TIMESTAMP, ticker TEXT, close_value FLOAT)" + statement = "CREATE TABLE IF NOT EXISTS \"" + table_name + "\"" + columns + ";" + cursor.execute(statement) +``` + +Now I can move on to creating an insert function. + +### Inserting values into CrateDB + +I want to create a function that: + +* gets the table name and the data as parameters +* makes an insert statement for this data +* executes this statement + +*(In the next steps, I review each part of this function. However, I have a snippet of the complete function at the end of this section)* + +Formatting the entries is crucial for successful insertion. + +However, because of that, this function became rather long: so I will go through each section separately and then join them all in the end. + +* Before anything else, I import the `math` module to use later in this function. +* The function starts by creating an empty list called `values_array`. This list will hold the formatted values I want to insert into the table. +* Next, I loop through each row of the `data` and extract the row values using the `iloc` method, which returns the values of the specified row. +* For each row, I check if the `close_value` value for that row is `NaN` (not a number), and if so, set it to -1. This is done to handle missing data. +* Then I format the `closing_date` value to match the timestamp format that the table expects. The date is first converted to a string in the format “YYYY-MM-DD”, then a time in the format “T00:00:00Z” is added to the end. The resulting string is then wrapped in single quotes to create a string that matches the expected timestamp format. +* Finally, I create a string representing the values for this row in the format `(closing_date, ticker, close_value)`, and append it to the `values_array` list. I repeat this process for each row in the `data` DataFrame. + +```python +import math + +def insert_values(table_name, data): + + values_array = [] + + # adding each closing date, ticker and close value tuple to values array + + for row in range(len(data)): + + # saving entries from the ith row as a list of date values + + row_values = data.iloc[row, :] + + # checking if there is a NaN entry and setting it to -1 + + close_value = row_values['close_value'] + if (math.isnan(close_value)): + close_value = -1 + + # formatting date entries to match timestamp format + + closing_date = row_values['closing_date'].strftime("%Y-%m-%d") + closing_date = "'{}'".format( + closing_date + "{time}".format(time="T00:00:00Z")) + + # putting a comma between values tuples, but not on the last tuple + + values_array.append("({},\'{}\',{})".format( + closing_date, row_values['ticker'], close_value)) +``` + +* After all the row values have been added to the `values_array` list, I create a new table with the specified name (if it does not already exist) using the `create_table` function. +* Then I create the first part of the SQL `INSERT` statement, which includes the table name and the column names we insert into (`closing_date`, `ticker`, and `close_value`). This part of the statement is stored in the `insert_stmt` variable. +* Next, I add the values tuples from `values_array` to the `insert_stmt`, separated by commas. The final SQL `INSERT` statement is created by concatenating the `insert_stmt `variable and a semicolon at the end. + +```python + # creates a new table (in case it does not exist yet) + + create_table(table_name) + + # first part of the insert statement + + insert_stmt = "INSERT INTO \"{}\" (closing_date, ticker, close_value) VALUES ".format( + table_name) + + # adding data tuples to the insert statement + + insert_stmt += ", ".join(values_array) + ";" +``` +* Finally, the function executes the `INSERT` statement using the `cursor.execute()` method, and prints out a message indicating how many rows were inserted into the table. + +```python + cursor.execute(insert_stmt) + + print("Inserted " + str(len(data)) + " rows in CrateDB") +``` +In summary, in `insert_values`, I take the table name and the data, format the data into a SQL `INSERT` statement, and insert the data into the specified table. + +This is what the complete function looks like: + +```python +import math + +def insert_values(table_name, data): + + values_array = [] + + # adding each closing date, ticker and close value tuple to values array + + for row in range(len(data)): + + # saving entries from the ith row as a list of date values + + row_values = data.iloc[row, :] + + # checking if there is a NaN entry and setting it to -1 + + close_value = row_values['close_value'] + if (math.isnan(close_value)): + close_value = -1 + + # formatting date entries to match timestamp format + + closing_date = row_values['closing_date'].strftime("%Y-%m-%d") + closing_date = "'{}'".format( + closing_date + "{time}".format(time="T00:00:00Z")) + + # putting a comma between values tuples, but not on the last tuple + + values_array.append("({},\'{}\',{})".format( + closing_date, row_values['ticker'], close_value)) + + # creates a new table (in case it does not exist yet) + + create_table(table_name) + + # first part of the insert statement + + insert_stmt = "INSERT INTO \"{}\" (closing_date, ticker, close_value) VALUES ".format( + table_name) + + # adding data tuples to the insert statement + + insert_stmt += ", ".join(values_array) + ";" + + cursor.execute(insert_stmt) + + print("Inserted " + str(len(data)) + " rows in CrateDB") +``` +Now I can move on to the next function, which is quite handy regarding automation. + +### Selecting the last inserted Date + +I want my stock market data in CrateDB to be up to date, requiring I run this script regularly. + +However, I do not want to download data I already have or have duplicate entries in CrateDB. + +That’s why I create this function, which selects the most recent date from the data in my CrateDB table. I will use this date to calculate the period to download data from in the `download_data` function: this way, this function will only download new data! + +```python +def select_last_inserted_date(table_name): + + # creating table (only in case it does not exist yet) + + create_table(table_name) + + # selecting the maximum date in my table + + statement = "select max(closing_date) from " + table_name + ";" + cursor.execute(statement) + + # fetching the results from the query + + last_date_data = cursor.fetchall() + last_date = last_date_data[0][0] + + # if the query is empty or the date is None, start by 2023/01/01 + + if (len(last_date_data) == 0 or last_date is None): + print("No data yet, will return: 2023-01-01") + return datetime.strptime("2023-01-01", "%Y-%m-%d") + + # printing the last date + + print("Most recent data on CrateDB from: " + last_date.strftime("%Y-%m-%d")) + + return last_date +``` +In the `get_period_to_download` function, I calculate the difference between today and the last inserted date and return the corresponding period. + +```python +def get_period_to_download(last_date): + + # calculating the difference between today and the last date + + today = datetime.now() + days_difference = today - last_date.replace(tzinfo=None) + + # return the period corresponding to the difference, or 1 year + + if (days_difference < timedelta(days=5)): + return '5d' + elif (days_difference < timedelta(weeks=4)): + return '1mo' + elif (days_difference < timedelta(weeks=13)): + return '3mo' + elif (days_difference < timedelta(weeks=26)): + return '6mo' + else: + return '1y' +``` +The only thing missing is a method to wrap up everything. Let’s move on to it! + +### Updating the table + +This method wraps up all the others. + +* I first get the most recent date in the table with `select_last_inserted_date` +* Then I calculate the period between today and this date with `get_period_to_download` +* I take the list of all SP 500 tickers with `get_sp500_ticker_symbols` +* And then, for each of these tickers, I download the data with `download_data` and insert it in CrateDB with `insert_values` + +This is what the final function looks like: + +```python +def update_table(table_name): + + # getting the last date in the table + + last_date = select_last_inserted_date(table_name) + + # calculating the period to download data from + period = get_period_to_download(last_date) + + # getting all SP 500 tickers + tickers = get_sp500_ticker_symbols() + + # downloading and inserting data from each ticker + for ticker in tickers: + data = download_data(ticker, period) + insert_values(table_name, data) +``` + +## Final Test + +I have all the necessary functions ready to work! + +To have a clean final test, I + +* place all the functions at the beginning of the Notebook and run their code blocks +* leave the CrateDB connection and the `update_table` call at the end + +```python +# Connecting to CrateDB + +conn = ps.connect(host="", port=5432, + user="admin", password="", sslmode="require") +cursor = conn.cursor() + +# Updating table + +table_name = "sp500" + +update_table(table_name) +``` + +I navigate to the CrateDB Admin UI, where I see the new table **sp500** was created and that it is filled with the financial data. + +![sp500-data-cratedb|690x405](https://us1.discourse-cdn.com/flex020/uploads/crate/original/1X/868f1a3b16b58884779377892908243a8779b15f.png){w=800px} + + +I make a simple query to get Apple’s data from my **sp500** table + +```sql +SELECT * +FROM "admin"."sp500" +WHERE ticker = 'AAPL' +ORDER BY closing_date LIMIT 100; +``` +And instantly get the results. + +![apple-data|690x405](https://us1.discourse-cdn.com/flex020/uploads/crate/original/1X/9e2fbb2abdf2946bf063466de4f8468650c6d578.png){w=800px} + +Now I can run this script whenever I want to update my database with new data! + +## Wrap up + +In this post, I introduced a method to download financial data from Yahoo Finance using Python and pandas and showed how to insert this data in CrateDB. + +I profited from CrateDB’s high efficiency in rapidly inserting a large amount of data into my database and presented a method to get the most recent input date from CrateDB. That way, I can efficiently keep my records in CrateDB up to date! From 096ec9cc499bf251e2f5b091cc79fe8cce04d6c2 Mon Sep 17 00:00:00 2001 From: Marija Selakovic Date: Sun, 14 Sep 2025 21:41:57 +0200 Subject: [PATCH 3/4] pandas: Efficient ingest --- docs/integrate/pandas/efficient-ingest.md | 57 +++++++++++++++++++++++ docs/integrate/pandas/index.md | 8 ++-- 2 files changed, 61 insertions(+), 4 deletions(-) create mode 100644 docs/integrate/pandas/efficient-ingest.md diff --git a/docs/integrate/pandas/efficient-ingest.md b/docs/integrate/pandas/efficient-ingest.md new file mode 100644 index 00000000..d85ec515 --- /dev/null +++ b/docs/integrate/pandas/efficient-ingest.md @@ -0,0 +1,57 @@ +(pandas-efficient-ingest)= +# Guide to efficient data ingestion to CrateDB with pandas + +## Introduction +Bulk insert is a technique for efficiently inserting large amounts of data into a database by submitting multiple rows of data in a single database transaction. Instead of executing multiple SQL `INSERT` statements for each individual row of data, the bulk insert allows the database to process and store a batch of data at once. This approach can significantly improve the performance of data insertion, especially when dealing with large datasets. + +In this tutorial, you will learn how to efficiently perform [bulk inserts](https://crate.io/docs/python/en/latest/by-example/sqlalchemy/dataframe.html) into CrateDB with [pandas](https://pandas.pydata.org/) using the `insert_bulk` method, available in the `crate` Python library. To follow along with this tutorial, you should have the following: + +* A working installation of CrateDB. To get started with CrateDB check [this link](https://crate.io/lp-free-trial?hsCtaTracking=c2099713-cafa-4de6-a97e-2f86d80a788f%7C3a12b78e-e605-461c-9bd8-628d0d9e2522). +* Python, Pandas, SQLAlchemy, and [crate driver](https://pypi.org/project/crate/) installed on your machine +* Basic familiarity with pandas and SQL + +## Bulk insert to CrateDB + +The following example illustrates how to implement batch insert with the pandas library by using the `insert_bulk` method available in the `crate` driver. + +```python +import sqlalchemy as sa +import crate +import pandas as pd +from sqlalchemy import create_engine +from crate.client.sqlalchemy.support import insert_bulk +from pandas._testing import makeTimeDataFrame + +INSERT_RECORDS = 5000000 +CHUNK_SIZE = 50000 + +df = makeTimeDataFrame(nper=INSERT_RECORDS, freq="S") +engine = sa.create_engine('crate://localhost:4200') + +df.to_sql( + name="cratedb-demo", + con=engine, + if_exists="replace", + index=False, + chunksize=CHUNK_SIZE, + method=insert_bulk, +) +``` + +By running this code, you will generate a DataFrame with a time-based index containing 5,000,000 rows of data. Each row represents a timestamp with a frequency of 1 second (`freq="S"`). The DataFrame is then inserted into a `cratedb-demo` table in CrateDB using the `to_sql()` method. If the table already exists, it will be replaced with the new data. The data insertion will be performed in batches, with each batch containing 50,000 records. Defining the `chunksize` parameter helps in managing memory and improving performance during the data insertion process. + +The above code runs in approximately 14s on a local Mac M1 machine with 16GiB RAM. However, if we insert data to CrateDB by setting the `method` parameter to `None` (one insert per row), the execution time increases to 27sec. + +## How to find the right chunksize + +Determining the right chunksize depends on several factors, such as the size of your data, the number of columns in your data set, and the available memory of your machine. + +The `chunksize` parameter in the `to_sql()` method controls the number of rows inserted in each batch. By default, `chunksize=None`, which means the entire DataFrame will be written to the database at once. However, when working with large datasets, it is recommended to set a smaller `chunksize` value to avoid memory issues and to improve the performance of the data insertion. + +To determine the right `chunksize` value, you can try different values and observe the memory usage and the time it takes to complete the data insertion. A good starting point is to set the `chunksize` value to a fraction of the total number of rows in your DataFrame. For example, you can start with a `chunksize` value of 10,000 or 50,000 rows and see how it performs. If the data insertion is slow, you can try increasing the `chunksize` value to reduce the number of batches. On the other hand, if you encounter memory issues, you can try reducing the `chunksize` value. + +## Conclusion + +Congratulations! You have learned how to implement an efficient data insert into CrateDB using Pandas and `insert_bulk` method. This method allows for efficient and fast data insertion, making it suitable for handling large datasets. + +If you like this tutorial and want to explore further CrateDB functionalities, please visit our [documentation](https://crate.io/docs) and join our [community](https://community.cratedb.com/). diff --git a/docs/integrate/pandas/index.md b/docs/integrate/pandas/index.md index 8d4c353e..62d73cf9 100644 --- a/docs/integrate/pandas/index.md +++ b/docs/integrate/pandas/index.md @@ -39,8 +39,9 @@ and operations for manipulating numerical tables and time series. ::: - {ref}`pandas-tutorial-start` - {ref}`pandas-tutorial-jupyter` -- [Importing Parquet files into CrateDB using Apache Arrow and SQLAlchemy] -- [Guide to efficient data ingestion to CrateDB with pandas] +- {ref}`arrow-import-parquet` +- {ref}`pandas-efficient-ingest` +- [Efficient batch/bulk INSERT operations with pandas, Dask, and SQLAlchemy] - [pandas code examples] @@ -49,11 +50,10 @@ and operations for manipulating numerical tables and time series. :hidden: Starter tutorial Jupyter tutorial +Efficient ingest ::: [Efficient batch/bulk INSERT operations with pandas, Dask, and SQLAlchemy]: https://cratedb.com/docs/python/en/latest/by-example/sqlalchemy/dataframe.html -[Guide to efficient data ingestion to CrateDB with pandas]: https://community.cratedb.com/t/guide-to-efficient-data-ingestion-to-cratedb-with-pandas/1541 -[Importing Parquet files into CrateDB using Apache Arrow and SQLAlchemy]: https://community.cratedb.com/t/importing-parquet-files-into-cratedb-using-apache-arrow-and-sqlalchemy/1161 [pandas]: https://pandas.pydata.org/ [pandas code examples]: https://github.com/crate/cratedb-examples/tree/main/by-dataframe/pandas From afec250c2447f258af8b001bc7d435ca6b29447d Mon Sep 17 00:00:00 2001 From: Andreas Motl Date: Sun, 14 Sep 2025 22:57:44 +0200 Subject: [PATCH 4/4] pandas: Update/fix link references --- docs/integrate/pandas/tutorial-jupyter.md | 4 ++-- docs/integrate/pandas/tutorial-start.md | 8 ++++++-- 2 files changed, 8 insertions(+), 4 deletions(-) diff --git a/docs/integrate/pandas/tutorial-jupyter.md b/docs/integrate/pandas/tutorial-jupyter.md index bbf77282..4778a15c 100644 --- a/docs/integrate/pandas/tutorial-jupyter.md +++ b/docs/integrate/pandas/tutorial-jupyter.md @@ -21,7 +21,7 @@ So, let’s get started. If you’re new to CrateDB and want to get started quickly and easily, a great option is to try the **Free Tier** in CrateDB Cloud. With the **Free Tier**, you have a limited Cluster that is free forever; no payment method is required. Now, if you are ready to experience the full power of CrateDB Cloud, take advantage of the 200$ in free credits to try the cluster of your dreams. -To start with CrateDB Cloud, [navigate to the CrateDB website](https://crate.io/download?hsCtaTracking=caa20047-f2b6-4e8c-b7f9-63fbf818b17f%7Cf1ad6eaa-39ac-49cd-8115-ed7d5dac4d63) and follow the steps to create your CrateDB Cloud account. Once you log in to the CrateDB Cloud UI, select **Deploy Cluster** to create your free cluster, and you are ready to go! +To start with CrateDB Cloud, [navigate to the CrateDB website](https://cratedb.com/download?hsCtaTracking=caa20047-f2b6-4e8c-b7f9-63fbf818b17f%7Cf1ad6eaa-39ac-49cd-8115-ed7d5dac4d63) and follow the steps to create your CrateDB Cloud account. Once you log in to the CrateDB Cloud UI, select **Deploy Cluster** to create your free cluster, and you are ready to go! ![cratedb-cloud-free-tier](https://us1.discourse-cdn.com/flex020/uploads/crate/original/1X/26fc603ca998d39631f93f1eb7c5dbd30f437e56.gif) @@ -81,7 +81,7 @@ When I read [yfinance](https://pypi.org/project/yfinance/)’s documentation (ve I want to download data from all S&P-500 companies, so having a list with all their symbols would be perfect. -I then found [this tutorial by Edoardo Romani](https://towardsdatascience.com/how-to-automate-financial-data-collection-with-python-using-tiingo-api-and-google-cloud-platform-b11d8c9afaa1), which shows how to get the symbols from the [List of S&P-500 companies' Wikipedia page](https://en.wikipedia.org/wiki/List_of_S%26P_500_companies) and store them in a list. +I then found [this tutorial by Edoardo Romani](https://medium.com/data-science/how-to-automate-financial-data-collection-with-python-using-tiingo-api-and-google-cloud-platform-b11d8c9afaa1), which shows how to get the symbols from the [List of S&P-500 companies' Wikipedia page](https://en.wikipedia.org/wiki/List_of_S%26P_500_companies) and store them in a list. So, in my Notebook, I import [BeautifulSoup 4.10.0](https://beautiful-soup-4.readthedocs.io/en/latest/) and [requests 2.26.0](https://pypi.org/project/requests/) to pull out HTML files from Wikipedia and create the following function: diff --git a/docs/integrate/pandas/tutorial-start.md b/docs/integrate/pandas/tutorial-start.md index 3464691d..c6213eb7 100644 --- a/docs/integrate/pandas/tutorial-start.md +++ b/docs/integrate/pandas/tutorial-start.md @@ -19,11 +19,15 @@ To follow along with this tutorial, you will need: ## Setting up CrateDB -Before we can start using CrateDB, we need to set it up. You can either download and install CrateDB locally via [Docker](https://crate.io/docs/crate/tutorials/en/latest/basic/index.html#docker) or [tarball](https://crate.io/docs/crate/tutorials/en/latest/basic/index.html#try-cratedb-without-installing) or use a [CrateDB Cloud](https://crate.io/download?hsCtaTracking=caa20047-f2b6-4e8c-b7f9-63fbf818b17f%7Cf1ad6eaa-39ac-49cd-8115-ed7d5dac4d63) instance with an option of the free cluster. +Before we can start using CrateDB, we need to set it up. You can either download and +install CrateDB locally via {ref}`Docker ` or +{ref}`tarball ` or use a +[CrateDB Cloud](https://cratedb.com/download?hsCtaTracking=caa20047-f2b6-4e8c-b7f9-63fbf818b17f%7Cf1ad6eaa-39ac-49cd-8115-ed7d5dac4d63) +instance with an option of the free cluster. Once you have a running instance of CrateDB, create a new table to store the customer data dataset. Here is an SQL command to create a table: -``` sql +```sql CREATE TABLE IF NOT EXISTS "doc"."customer_data" ( "customerid" INTEGER, "gender" TEXT,