You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/integrate/airflow/data-retention-hot-cold.md
+6-5Lines changed: 6 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,7 +7,7 @@ In this fourth article on automating recurrent CrateDB queries with [Apache Airf
7
7
8
8
A hot/cold storage strategy is often motivated by a tradeoff between performance and cost-effectiveness. In a database such as CrateDB, more recent data tends to have a higher significance for analytical queries. Well-performing disks (hot storage) play a key role on the infrastructure side to support performance requirements but can come at a high cost. As data ages and gets less business-critical for near-real-time analysis, transitioning it to slower/cheaper disks (cold storage) helps to improve the cost-performance ratio.
9
9
10
-
In a CrateDB cluster, nodes can have different hardware specifications. Hence, a cluster can consist of a combination of hot and cold storage nodes, each with respective disks. By assigning corresponding attributes to nodes, CrateDB can be made aware of such node types and consider if when allocating partitions.
10
+
In a CrateDB cluster, nodes can have different hardware specifications. Hence, a cluster can consist of a combination of hot and cold storage nodes, each with respective disks. By assigning corresponding attributes to nodes, CrateDB recognizes these node types and considers it when allocating partitions.
11
11
12
12
## CrateDB setup
13
13
@@ -131,11 +131,12 @@ To remember which partitions have already been reallocated, we can make use of t
131
131
132
132
## Airflow setup
133
133
134
-
We assume that a basic Astronomer/Airflow setup is already in place, as described in our {ref}`first post <airflow-export-s3>`. Let’s quickly go through the three steps of the algorithm:
134
+
Assume a basic Astronomer/Airflow setup is in place, as described in the {ref}`first post <airflow-export-s3>`. The algorithm has three steps:
135
135
136
136
1. `get_policies`: A query on `doc.retention_policies` and `information_schema.table_partitions` identifies partitions affected by a retention policy.
137
137
2. `map_policy`: A helper method transforming the output of `get_policies` into a Python `dict` data structure for easier handling.
138
-
4. `reallocate_partitions`: Executes an SQL statement for each mapped policy: `ALTER TABLE <table> PARTITION (<partition key> = <partition value>) SET ("routing.allocation.require.storage" = 'cold');`
138
+
3. `reallocate_partitions`: Executes an SQL statement for each mapped policy: `ALTER TABLE <table> PARTITION (<partition key> = <partition value>) SET ("routing.allocation.require.storage" = 'cold');`
139
+
139
140
The CrateDB cluster will then automatically initiate the relocation of the affected partition to a node that fulfills the requirement (`cratedb03` in our case).
140
141
141
142
The full implementation is available as [data_retention_reallocate_dag.py](https://github.com/crate/crate-airflow-tutorial/blob/main/dags/data_retention_reallocate_dag.py) on GitHub.
@@ -168,5 +169,5 @@ INSERT INTO doc.retention_policies (table_schema, table_name, partition_column,
168
169
169
170
## Summary
170
171
171
-
Building upon the previously discussed data retention policy implementation, we showed that reallocating partitions integrates seemingly and consists only of a single SQL statement.
172
-
CrateDB’s self-organization capabilities take care of all low-level operations and the actual moving of partitions. Furthermore, we showed that a multi-staged approach to data retention policies can be achieved by first reallocating and eventually deleting partitions permanently.
172
+
Building on the earlier data retention policy implementation, reallocating partitions integrates seamlessly and consists of a single SQL statement.
173
+
CrateDB’s self-organization handles the low-level operations and the actual movement of partitions. A multi‑stage policy is straightforward: first reallocate, then delete.
In the DAG's main method, we can now make use of Airflows' [dynamic task mapping](https://airflow.apache.org/docs/apache-airflow/2.3.0/concepts/dynamic-task-mapping.html)which allows executing the same task several times, with different parameters:
97
+
In the DAG’s main method, use Airflow’s [dynamic task mapping](https://airflow.apache.org/docs/apache-airflow/2.3.0/concepts/dynamic-task-mapping.html)to execute the same task several times with different parameters:
98
98
99
99
```python
100
100
SQLExecuteQueryOperator.partial(
@@ -118,7 +118,7 @@ def map_policy(policy):
118
118
119
119
@task
120
120
defget_policies(ds=None):
121
-
"""Retrieve all partitions effected by a policy"""
121
+
"""Retrieve all partitions affected by a policy"""
In general, to export data to a file one can use the `COPY TO` statement in CrateDB. This command exports the content of a table to one or more JSON files in a given directory. JSON files have unique names and they are formatted to contain one table row per line. The `TO` clause specifies the URI string of the output location. CrateDB supports two URI schemes: `file` and `s3`. We use the `s3` scheme to access the bucket on Amazon S3. Further information on different clauses of the `COPY TO` statement can be found in the official {ref}`CrateDB documentation <crate-reference:sql-copy-to>`.
@@ -34,7 +34,7 @@ To export data from the `metrics` table to S3, we need a statement such as:
34
34
35
35
## DAG implementation
36
36
37
-
In order to build a generic DAG that is not specific to one single table configuration, we first create a file `include/table_exports.py`, containing a list of dictionaries (key/value pairs) for each table to export:
37
+
To keep the DAG generic, create `include/table_exports.py` with one dictionary per table to export:
Copy file name to clipboardExpand all lines: docs/integrate/airflow/import-parquet.md
+9-7Lines changed: 9 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@
4
4
## Introduction
5
5
Using Airflow to import the NYC Taxi and Limousine dataset in Parquet format.
6
6
7
-
Currently, Parquet imports using COPY FROM are not supported by CrateDB, it only supports CSV and JSON files instead. Because of that, we implemented a different approach from simply changing the previous implementation from CSV to Parquet.
7
+
CrateDB does not support `COPY FROM` for Parquet. It supports CSV and JSON. Therefore, this tutorial uses an alternative approach rather than switching the previous CSV workflow to Parquet.
8
8
9
9
First and foremost, keep in mind the strategy presented here for importing Parquet files into CrateDB, we have already covered this topic in a previous tutorial using a different approach from the one introduced in this tutorial, so feel free to have a look at the tutorial about {ref}`arrow-import-parquet` and explore with the different possibilities out there.
10
10
@@ -80,20 +80,19 @@ Ok! So, once the tools are already set up with the corresponding tables created,
80
80

81
81
82
82
The DAG pictured above represents a routine that will run every month to retrieve the latest released file by NYC TLC based on the execution date of that particular instance. Since it is configured to catch up with previous months when enabled, it will generate one instance for each previous month since January 2009 and each instance will download and process the corresponding month, based on the logical execution date.
83
-
The Airflow DAG used in this tutorial contains 6 tasks which are described below:
84
-
83
+
The Airflow DAG used in this tutorial contains 7 tasks:
85
84
***format_file_name:** according to the NYC Taxi and Limousine Commission (TLC) [documentation](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page), the files are named after the month they correspond to, for example:
The file path above corresponds to the data from March 2022. So, to retrieve a specific file, the task gets the date and formats it to compose the name of the specific file. Important to mention that the data is released with 2 months of delay, so it had to be taken into consideration.
90
-
***process_parquet:** afterward, the name is used to download the file to local storage and then transform it from the Parquet format into CSV using CLI tools of [Apache Arrow](https://github.com/apache/arrow), as follows:
89
+
***process_parquet:** afterward, the name is used to download the file to local storage and then transform it from Parquet to CSV using [`parquet-tools`] (Apache Parquet CLI, see [Apache Arrow])
***copy_csv_to_s3:** Once the newly transformed file is available, it gets uploaded to an S3 Bucket to then, be used in the {ref}`crate-reference:sql-copy-from` SQL statement.
95
94
***copy_csv_staging:** copy the CSV file stored in S3 to the staging table described previously.
96
-
***copy_stating_to_trips:** finally, copy the data from the staging table to the trips table, casting the columns that are not in the right type yet.
95
+
***copy_staging_to_trips:** finally, copy the data from the staging table to the trips table, casting the columns that are not in the right type yet.
97
96
***delete_staging:** after it is all processed, clean up the staging table by deleting all rows, and preparing for the next file.
98
97
***delete_local_parquet_csv:** delete the files (Parquet and CSV) from the storage.
99
98
@@ -114,3 +113,6 @@ are other approaches out there, we encourage you to try them out.
114
113
115
114
If you want to continue to explore how CrateDB can be used with Airflow, you can
116
115
check other tutorials related to that topic {ref}`here <airflow-tutorials>`.
Copy file name to clipboardExpand all lines: docs/integrate/airflow/import-stock-market-data.md
+8-8Lines changed: 8 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -34,7 +34,7 @@ Let’s now create a table to store your financial data. I'm particularly intere
34
34
CREATETABLEsp500 (
35
35
closing_date TIMESTAMP,
36
36
ticker TEXT,
37
-
adjusted_close double,
37
+
adjusted_close DOUBLE PRECISION,
38
38
primary key (closing_date, ticker)
39
39
);
40
40
```
@@ -105,22 +105,24 @@ Create a `.py` file for your DAG in your `astro-project/dags` folder; I will cal
105
105
106
106
### Import operators and modules
107
107
108
-
Let’s start by importing the necessary operator to connect to CrateDB, the `PostgresOperator`, and the decorator to define the DAG and its tasks. You will also import the `datetime`, `pendulum` modules to set up your schedule and the `yfinance`, `pandas`, and `json` modules to download and manipulate the financial data later.
108
+
Import the operator used in this tutorial, `SQLExecuteQueryOperator`,
109
+
and the decorator to define the DAG and its tasks. You will also import
110
+
the `datetime`, `pendulum` modules to set up your schedule and the
111
+
`yfinance`, `pandas`, and `json` modules to download and manipulate the
112
+
financial data later.
109
113
```python
110
114
import datetime
111
115
import math
112
116
import json
113
117
import logging
114
118
import pendulum
115
-
import requests
116
-
from bs4 import BeautifulSoup
117
119
import yfinance as yf
118
120
import pandas as pd
119
121
from airflow.providers.common.sql.operators.sql import SQLExecuteQueryOperator
120
122
from airflow.decorators import dag, task
121
123
```
122
124
Don’t forget to add these modules to the `requirements.txt` file inside your project like so:
123
-
```
125
+
```text
124
126
apache-airflow-providers-postgres>=5.3.1
125
127
apache-airflow-providers-common-sql>=1.3.1
126
128
apache-airflow[pandas]
@@ -133,7 +135,7 @@ The next step is to declare the necessary tasks to download, prepare and insert
133
135
#### Download task
134
136
135
137
Let's first write a function to download data from `yfinance`; I will call it `download_yfinance_data`.
136
-
You can use ds for today’s date or get yesterday’s date with `airflow.macros.ds_add(ds, -1)`. You start by listing tickers from stocks of interest into a `tickers` variable. You then pass this list and the start date as arguments to the `yf.download` function and store the result in a `data `variable. `data` is a pandas data frame with various values for each stock, such as high/low, volume, dividends, and so on. Today, I will focus on the adjusted close value, so I filter data using the `Adj Close` key. Moreover, I return the data as a JSON object (instead of a data frame) because it works better with XCom, which is Airflow's mechanism to talk between tasks. Finally, you set this function as an Airflow task using the `@task` decorator and give it an execution timeout.
138
+
You can use ds for today’s date or get yesterday’s date with `airflow.macros.ds_add(ds, -1)`. You start by listing tickers from stocks of interest into a `tickers` variable. You then pass this list and the start date as arguments to the `yf.download` function and store the result in a `data` variable. `data` is a pandas data frame with various values for each stock, such as high/low, volume, dividends, and so on. Today, I will focus on the adjusted close value, so I filter data using the `Adj Close` key. Moreover, I return the data as a JSON object (instead of a data frame) because it works better with XCom, which is Airflow's mechanism to talk between tasks. Finally, you set this function as an Airflow task using the `@task` decorator and give it an execution timeout.
Copy file name to clipboardExpand all lines: docs/integrate/arrow/import-parquet.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,7 +9,7 @@ Apache Parquet is a free and open-source column-oriented data storage format. It
9
9
10
10
## Prerequisites
11
11
12
-
The libraries needed are**crate**, **sqlalchemy** and **pyarrow**, so you should install them. To do so, you can use the following `pip install` command. To check the latest version supported by CrateDB, have a look at the {ref}`sqlalchemy-cratedb:index`.
12
+
Install the required libraries:**sqlalchemy-cratedb** and **pyarrow**.
Now, make sure to set up the SQLAlchemy engine and session as seen below. If you are not using localhost, remember to replace the URI string with your own.
Before processing the newly imported file, the corresponding Model must be created, this is the representation of the final table, if you are using a different dataset, adapt the model to your data. Remember that the attribute name is casesensitive, so in our example **vendorID**will have the same name in CrateDB's table.
49
+
Before processing the newly imported file, the corresponding Model must be created, this is the representation of the final table, if you are using a different dataset, adapt the model to your data. Attribute names are case‑sensitive. In this example,**VendorID**in the model maps to "VendorID" in CrateDB.
It will create the table in CrateDB **if it does not exist**, if you change the schema after creating the table, it might fail, in this case you will need to rename the table in CrateDB and adapt the model, if the data doesn't matter you can delete the table and re-create it.
83
+
It creates the table in CrateDB if it does not exist. If you later change the schema, `create_all()` may fail. In that case, rename or drop the existing table, adjust the model, and recreate it.
84
84
85
85
For further details on how to use the SQLAlchemy with CrateDB, you can refer to the {ref}`sqlalchemy-cratedb:index`.
0 commit comments