Airflow: Implement suggestions by CodeRabbit, part 4

amotl · amotl · commit 961bff7e9d93 · 2025-09-16T22:49:36.000+02:00
diff --git a/docs/integrate/airflow/data-retention-policy.md b/docs/integrate/airflow/data-retention-policy.md
@@ -100,11 +100,13 @@ In the DAG’s main method, use Airflow’s [dynamic task mapping](https://airfl
 SQLExecuteQueryOperator.partial(
     task_id="delete_partition",
     conn_id="cratedb_connection",
-    sql="DELETE FROM {{params.table_fqn}} WHERE {{params.column}} = {{params.value}};",
+    sql="DELETE FROM {{ params.table_fqn }} WHERE {{ params.column }} = {{ params.value }};",
 ).expand(params=get_policies().map(map_policy))
 ```
 
-`get_policies` returns a set of policies. On each policy, the `map_policy` is applied. The return value of `map_policy` is finally passed as `params` to the `SQLExecuteQueryOperator`.
+`get_policies` returns a set of policies. On each policy, the `map_policy` is
+applied. The return value of `map_policy` is finally passed as `params` to the
+`SQLExecuteQueryOperator`.
 
 This leads us already to the final version of the DAG:
 ```python
@@ -136,7 +138,7 @@ def data_retention_delete():
     SQLExecuteQueryOperator.partial(
         task_id="delete_partition",
         conn_id="cratedb_connection",
-        sql="DELETE FROM {{params.table_fqn}} WHERE {{params.column}} = {{params.value}};",
+        sql="DELETE FROM {{ params.table_fqn }} WHERE {{ params.column }} = {{ params.value }};",
     ).expand(params=get_policies().map(map_policy))
 
 
diff --git a/docs/integrate/airflow/getting-started.md b/docs/integrate/airflow/getting-started.md
@@ -59,15 +59,15 @@ Example output:
 For other operating systems, follow the [official documentation](https://www.astronomer.io/docs/astro/cli/install-cli).
 After installing the Astronomer CLI, initialize a new project:
 
-* Create a project directory:
+- Create a project directory:
   ```bash
   mkdir astro-project && cd astro-project
   ```
-* Initialize the project with the following command:  
+- Initialize the project with the following command:
    ```bash
    astro dev init
    ```
-* This will create a skeleton project directory as follows:
+- This will create a skeleton project directory as follows:
    ```text
    ├── Dockerfile
    ├── README.md
@@ -81,14 +81,16 @@ After installing the Astronomer CLI, initialize a new project:
    ```
 
 The astronomer project consists of four Docker containers:
-*   PostgreSQL server (for configuration/runtime data)
-*   Airflow scheduler
-*   Web server for rendering Airflow UI
-*  Triggerer (running an event loop for deferrable tasks)
+- PostgreSQL server (for configuration/runtime data)
+- Airflow scheduler
+- Web server for rendering Airflow UI
+- Triggerer (running an event loop for deferrable tasks)
 
-The PostgreSQL server is configured to listen on port 5432. The web server is listening on port 8080 and can be accessed via http://localhost:8080/ with `admin` for both username and password.
+The PostgreSQL server listens on port 5432. The web server listens on port 8080
+and is available at <http://localhost:8080/> with `admin`/`admin`.
 
-If these ports are already in use, change them in `.astro/config.yaml`. For example, set the webserver to 8081 and PostgreSQL to 5435:
+If these ports are already in use, change them in `.astro/config.yaml`. For
+example, set the webserver to 8081 and PostgreSQL to 5435:
 ```yaml
 project:
   name: astro-project
@@ -98,7 +100,8 @@ postgres:
   port: 5435
 ```
 
-Start the project with `astro dev start`. After the containers start, access the Airflow UI at `http://localhost:8081`:
+Start the project with `astro dev start`. After the containers start, access
+the Airflow UI at <http://localhost:8081>:
 
 ![Airflow UI landing page](https://us1.discourse-cdn.com/flex020/uploads/crate/original/1X/f298a4c609312133e388555a9eba51733bfd5645.png)
 
@@ -127,13 +130,17 @@ The initialized `astro-project` now has a home on GitHub.
 
 ## Add database credentials
 
-To configure the connection to CrateDB we need to set up a corresponding environment variable. On Astronomer the environment variable can be set up via the Astronomer UI, via `Dockerfile`, or via a `.env` file which is automatically generated during project initialization.
+To configure the CrateDB connection, set an environment variable. On
+Astronomer, set it via the UI, `Dockerfile`, or the `.env` file
+(generated during initialization).
 
 In this tutorial, we will set up the necessary environment variables via a `.env` file. To learn about alternative ways, please check the [Astronomer documentation](https://docs.astronomer.io/astro/environment-variables). The first variable we set is one for the CrateDB connection, as follows:
 
 `AIRFLOW_CONN_CRATEDB_CONNECTION=postgresql://<user>:<password>@<host>/doc?sslmode=disable`
 
-In case a TLS connection is required, change `sslmode=require`. To confirm that a new variable is applied, first, start the Airflow project and then create a bash session in the scheduler container by running `docker exec -it <scheduler_container_name> /bin/bash`.
+For TLS, set `sslmode=require`. To confirm that the variable is applied, start
+the project and open a bash session in the scheduler container:
+`docker exec -it <scheduler_container_name> /bin/bash`.
 
 Run `env` to list the applied environment variables.
 
diff --git a/docs/integrate/airflow/import-parquet.md b/docs/integrate/airflow/import-parquet.md
@@ -2,19 +2,26 @@
 # Automating the import of Parquet files with Apache Airflow
 
 ## Introduction
-Using Airflow to import the NYC Taxi and Limousine dataset in Parquet format.
 
-CrateDB does not support `COPY FROM` for Parquet. It supports CSV and JSON. Therefore, this tutorial uses an alternative approach rather than switching the previous CSV workflow to Parquet.
+Use Airflow to import the NYC Taxi and Limousine dataset provided in Parquet format.
 
-First and foremost, keep in mind the strategy presented here for importing Parquet files into CrateDB, we have already covered this topic in a previous tutorial using a different approach from the one introduced in this tutorial, so feel free to have a look at the tutorial about {ref}`arrow-import-parquet` and explore with the different possibilities out there.
+CrateDB supports `COPY FROM` for CSV and JSON, not Parquet. This tutorial converts
+Parquet to CSV before loading.
+
+For an alternative Parquet ingestion approach, see {ref}`arrow-import-parquet`.
 
 ## Prerequisites
 
-Before getting started, you need to have some knowledge of Airflow and an instance of Airflow already running. Besides that, a CrateDB instance should already be set up before moving on with this tutorial. This SQL is also available in the setup folder in our [GitHub repository](https://github.com/crate/crate-airflow-tutorial).
+Before you start, have Airflow and CrateDB running. The SQL shown below also
+resides in the setup folder of the
+[GitHub repository](https://github.com/crate/crate-airflow-tutorial).
 
-We start by creating the two tables in CrateDB: A temporary staging table (`nyc_taxi.load_trips_staging`) and the final destination table (`nyc_taxi.trips`).
+Create two tables in CrateDB: a temporary staging table
+(`nyc_taxi.load_trips_staging`) and the final table (`nyc_taxi.trips`).
 
-In this case, the staging table is a primary insertion point, which was later used to cast data to their final types. For example, the `passenger_count` column is defined as `REAL` in the staging table, while it is defined as `INTEGER` in the `nyc_taxi.trips` table.
+Insert into the staging table first, then cast values into their final
+types when inserting into `nyc_taxi.trips`. For example, `passenger_count`
+is `REAL` in staging and `INTEGER` in `nyc_taxi.trips`.
 
 ```sql
 CREATE TABLE IF NOT EXISTS "nyc_taxi"."load_trips_staging" (
@@ -74,7 +81,7 @@ PARTITIONED BY ("pickup_year");
 To better understand how Airflow works and its applications, you can check other
 tutorials related to that topic {ref}`here <airflow-tutorials>`.
 
-Ok! So, once the tools are already set up with the corresponding tables created, we should be good to go.
+With the tools set up and tables created, proceed to the DAG.
 
 ## The Airflow DAG
 ![Airflow DAG workflow|690x76](https://us1.discourse-cdn.com/flex020/uploads/crate/original/1X/29502f83c13d29d90ab703a399f58c6daeee6fe6.png)
@@ -86,10 +93,12 @@ The Airflow DAG used in this tutorial contains 7 tasks:
    https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-03.parquet
    ```
    The file path above corresponds to the data from March 2022. So, to retrieve a specific file, the task gets the date and formats it to compose the name of the specific file. Important to mention that the data is released with 2 months of delay, so it had to be taken into consideration.
-* **process_parquet:** afterward, the name is used to download the file to local storage and then transform it from Parquet to CSV using [`parquet-tools`] (Apache Parquet CLI, see [Apache Arrow])
-   * `curl -o "<LOCAL-PARQUET-FILE-PATH>" "<REMOTE-PARQUET-FILE>"`
-   * `parquet-tools csv <LOCAL-PARQUET-FILE-PATH> > <CSV-FILE-PATH>`
-   Both tasks are executed within one Bash Operator.
+* **process_parquet:** afterward, use the name to download the file to local storage and convert it from Parquet to CSV using `parquet-tools` (Apache Parquet CLI; see [Apache Arrow]).
+
+  * `curl -o "<LOCAL-PARQUET-FILE-PATH>" "<REMOTE-PARQUET-FILE>"`
+  * `parquet-tools csv <LOCAL-PARQUET-FILE-PATH> > <CSV-FILE-PATH>`
+
+  Both commands run within one `BashOperator`.
 * **copy_csv_to_s3:** Once the newly transformed file is available, it gets uploaded to an S3 Bucket to then, be used in the {ref}`crate-reference:sql-copy-from` SQL statement.
 * **copy_csv_staging:** copy the CSV file stored in S3 to the staging table described previously.
 * **copy_staging_to_trips:** finally, copy the data from the staging table to the trips table, casting the columns that are not in the right type yet.
@@ -101,9 +110,13 @@ The DAG was configured based on the characteristics of the data in use. In this
 * How often does the data get updated
 * When was the first file made available
 
-In this case, according to the NYC TLC website “Trip data is published monthly (with two months delay)”. So, the DAG is set up to run monthly, and given the first file was made available in January 2009, the start date was set to March 2009. But why March and not January? As previously mentioned, the files are made available with 2 months of delay, so the first DAG instance, which has a logical execution date equal to "March 2009" will retrieve March as the current month minus 2, corresponding to January 2009, the very first file ever published.
+The NYC TLC publishes trip data monthly with a two‑month delay. Set the DAG to
+run monthly with a start date of March 2009. The first run (logical date March
+2009) downloads the file for January 2009 (logical date minus two months),
+2010) which is the first available dataset.
 
-You may find the full code for the DAG described above available in our [GitHub repository](https://github.com/crate/crate-airflow-tutorial/blob/main/dags/nyc_taxi_dag.py).
+You may find the full code for the DAG described above available in our
+[GitHub repository](https://github.com/crate/crate-airflow-tutorial/blob/main/dags/nyc_taxi_dag.py).
 
 ## Wrap up
 
diff --git a/docs/integrate/airflow/import-stock-market-data.md b/docs/integrate/airflow/import-stock-market-data.md
@@ -1,35 +1,36 @@
 (airflow-import-stock-market-data)=
 # Updating stock market data automatically with CrateDB and Apache Airflow
 
-Watch this tutorial on Youtube: https://www.youtube.com/watch?v=YTTUzeaYUgQ&t=685s
+Watch this tutorial on YouTube: [Automating stock data with Airflow and CrateDB](https://www.youtube.com/watch?v=YTTUzeaYUgQ&t=685s).
 
 If you are struggling with keeping your stock market data up to date, this tutorial walks you through exactly what you need to do so you can automate data collection and storage from SP500 companies.
-![Picture by StockSnap on Pixabay](upload://tXDu25ajd6zX201Ju43lENW1uQ1.jpeg)
 
+## Quick overview
 
-## Quick Overview
 Let's have a quick overview of what you'll do:
 
-You have a goal: regularly update stock market data.
-To achieve your goal, you can divide it into tasks: download, prepare, and store data. You want to turn these tasks into a workflow, run it and observe the results; in other words, you want to orchestrate your workflow, and Airflow is the tool for that. 
-
-So the first thing to do is to start CrateDB and set up a table to store your data. Then, to orchestrate the process of regular data updates, you will create an Airflow project and establish the connection to CrateDB. Once you set up your Airflow project, you will write your tasks in Python as an Airflow DAG workflow (more details later). Finally, you will set a schedule for your workflow, and it's done!
+:Goal:      Update stock market data regularly.
+:Approach:  Define tasks to download, prepare, and store data; orchestrate them with Airflow.
+:Steps:     Start CrateDB and create a table; create an Airflow project and set the CrateDB connection; implement the DAG in Python; schedule it.
 
 ## Setup
+
 Let's get right to the setup on a Mac machine.
 You want to make sure you have Homebrew installed and Docker Desktop running.
 
 ### Run CrateDB and create a table to store data
 
-The first to do is to run CrateDB with Docker. It's easy: once you have Docker Desktop running, copy the Docker command from the CrateDB installation page and run it in your terminal. 
-
+First, run CrateDB with Docker. With Docker Desktop running, copy the command from the CrateDB installation page and run it:
 ```bash
 docker run --publish=4200:4200 --publish=5432:5432 --env CRATE_HEAP_SIZE=1g crate:latest
 ```
 
-With CrateDB running, you can now access the CrateDB Admin UI by going to your browser and typing *localhost:4200*.
+With CrateDB running, you can now access the CrateDB Admin UI by going to
+your browser and typing *localhost:4200*.
 
-Let’s now create a table to store your financial data. I'm particularly interested in the "adjusted-close" value for the stocks, so I will create a table that stores the date, the stock ticker, and the adjusted-close value. I will set the `closing_date` and `ticker` as primary keys. The final statement looks like this:
+Create a table to store financial data. Focus on the adjusted close value
+(“adjusted_close”) per ticker per day. Use a composite primary key on
+(`closing_date`, `ticker`):
 ```sql
 CREATE TABLE sp500 (
    closing_date TIMESTAMP,
@@ -75,20 +76,22 @@ Some information about the default settings: the PostgreSQL server is set up to
 There are now three things you have to adjust before running Airflow:
 
 * Add your CrateDB credentials to the `.env` file. Open the file in a text editor, and add the following line, which takes the default credentials for CrateDB, with user = crate, and password = null. (note: my internal port for running CrateDB in Docker is 5433, which I use here. If using the standard Docker command with 5432, here it should also be 5432).
-   ```bash
-   AIRFLOW_CONN_CRATEDB_CONNECTION=postgresql://crate:null@host.docker.internal:5433/doc?sslmode=disable
-   ```
+  ```bash
+  # For local development only; do not commit real credentials
+  AIRFLOW_CONN_CRATEDB_CONNECTION=postgresql://crate:null@host.docker.internal:5433/doc?sslmode=disable
+  ```
 * If the default ports are unavailable, you can change them to free ports. Just open the `.astro/config.yaml` file in a text editor and update the web server port to 8081 (instead of default 8080) and Postgres port to 5435 (instead of the default 5432), like so:
-   ```yaml
-   project:
-      name: astro-project
-   webserver:
-      port: 8081
-   postgres:
-     port: 5435
-   ```
+  ```yaml
+  project:
+     name: astro-project
+  webserver:
+     port: 8081
+  postgres:
+    port: 5435
+  ```
 
 ### Start Airflow
+
 Now you are done with the last adjustments, head back to your terminal and run this command to start Airflow: `astro dev start`
 You can now access Airflow in your browser at `http://localhost:8081`.
 
diff --git a/docs/integrate/airflow/index.md b/docs/integrate/airflow/index.md
@@ -68,7 +68,7 @@ journey. Spend time where it counts.
 :columns: 12
 :link: airflow-getting-started
 :link-type: ref
-Define an Airflow DAG that downloads, processes, and stores stock market data in CrateDB.
+Define an Airflow DAG that downloads, processes, and stores data in CrateDB.
 :::
 
 :::{grid-item-card} Tutorial: Import Parquet files