Airflow: Implement suggestions by CodeRabbit, part 2

amotl · amotl · commit b39fe56b5e57 · 2025-09-16T08:50:10.000+02:00
diff --git a/docs/integrate/airflow/getting-started.md b/docs/integrate/airflow/getting-started.md
@@ -7,46 +7,59 @@ Automate CrateDB queries with Apache Airflow.
 
 ## Introduction
 
-This first article shows how to use [Apache Airflow] with CrateDB to automate recurring queries.
+This guide shows how to use [Apache Airflow] with CrateDB to automate recurring queries.
 
-Then, we cover [Astronomer], the managed Apache Airflow provider, followed
-by instructions on how to set up the project with [Astronomer CLI].
-Finally, we illustrate with relatively simple examples how to schedule and
-execute recurring queries.
+You will:
+- understand [Astronomer], a managed Apache Airflow platform,
+- set up a local project with the [Astronomer CLI], and
+- schedule and execute recurring queries with simple examples.
 
 :::{rubric} Apache Airflow
 :::
-Apache Airflow is a platform for programmatically creating, scheduling, and monitoring workflows \[[Official documentation](https://airflow.apache.org/docs/)\]. Workflows are defined as directed acyclic graphs (DAGs) where each node in DAG represents an execution task. It is worth mentioning that each task is executed independently of other tasks and the purpose of DAG is to track the relationships between tasks. DAGs are designed to run on demand and in data intervals (e.g., twice a week).
+Apache Airflow programmatically creates, schedules, and monitors workflows
+\[[Official documentation](https://airflow.apache.org/docs/)\]. A workflow
+is a directed acyclic graph (DAG) where each node represents a task. Each
+task runs independently; the DAG tracks dependencies. Run DAGs on demand
+or on schedules (for example, twice a week).
 
 :::{rubric} CrateDB
 :::
-CrateDB is an open-source distributed database that makes storage and analysis of massive amounts of data simple and efficient. CrateDB offers a high degree of scalability, flexibility, and availability. It supports dynamic schemas, queryable objects, time-series data support, and real-time full-text search over millions of documents in just a few seconds.
+CrateDB is an open-source, distributed database for storing and analyzing
+large volumes of data. It offers high scalability, flexibility, and
+availability, supports dynamic schemas and queryable objects, and provides
+time series features and real-time full-text search over millions of
+documents in seconds.
 
-As CrateDB is designed to store and analyze massive amounts of data, continuous use of such data is a crucial task in many production applications of CrateDB. Needless to say, Apache Airflow is one of the most heavily used tools for the automation of big data pipelines. It has a very resilient architecture and scalable design. This makes Airflow an excellent tool for the automation of recurring tasks that run on CrateDB.
+Because CrateDB powers large-scale data workloads, many deployments automate
+recurring tasks. Apache Airflow’s resilient, scalable architecture makes it
+a strong choice for orchestrating those tasks on CrateDB.
 
 :::{rubric} Astronomer
 :::
-Since its inception in 2014, the complexity of Apache Airflow and its features has grown significantly. To run Airflow in production, it is no longer sufficient to know only Airflow, but also the underlying infrastructure used for Airflow deployment.
+Since 2014, Apache Airflow and its ecosystem have grown significantly. To run Airflow in production, you need to understand both Airflow and the underlying deployment infrastructure.
 
-To help maintain complex environments, one can use managed Apache Airflow providers such as Astronomer. Astronomer is one of the main managed providers that allows users to easily run and monitor Apache Airflow deployments. It runs on Kubernetes, abstracts all underlying infrastructure details, and provides a clean interface for constructing and managing different workflows.
+To simplify operations, use a managed Apache Airflow provider such as Astronomer. Astronomer runs on Kubernetes, abstracts infrastructure details, and provides a clean interface for building and operating workflows.
 
-## Setting up an Airflow project
-We set up a new Airflow project on an 8-core machine with 30GB RAM running Ubuntu 22.04 LTS. To initialize the project we use Astronomer CLI. The installation process requires [Docker](https://www.docker.com/) version 18.09 or higher. To install the latest version of the Astronomer CLI on Ubuntu, run:
+## Set up a local Airflow project
 
-`curl -sSL install.astronomer.io | sudo bash -s`
-
-To make sure that you installed Astronomer CLI on your machine, run:
+The examples use an 8‑core machine with 30 GB RAM on Ubuntu 22.04 LTS. Install the Astronomer CLI (requires [Docker](https://www.docker.com/) 18.09+). On Ubuntu:
+```shell
+curl -sSL install.astronomer.io | sudo bash -s
+```
 
-`astro version`
+Verify the installation:
+```shell
+astro version
+```
 
-If the installation was successful, you will see the output similar to:
+Example output:
 
 `Astro CLI Version: 1.14.1`
 
-To install Astronomer CLI on another operating system, follow the [official documentation](https://www.astronomer.io/docs/astro/cli/install-cli).
-After the successful installation of Astronomer CLI, create and initialize the new project as follows:
+For other operating systems, follow the [official documentation](https://www.astronomer.io/docs/astro/cli/install-cli).
+After installing the Astronomer CLI, initialize a new project:
 
-* Create project directory:
+* Create a project directory:
   ```bash
   mkdir astro-project && cd astro-project
   ```
@@ -75,7 +88,7 @@ The astronomer project consists of four Docker containers:
 
 The PostgreSQL server is configured to listen on port 5432. The web server is listening on port 8080 and can be accessed via http://localhost:8080/ with `admin` for both username and password.
 
-In case these ports are already occupied you can change them in the file `.astro/config.yaml` inside the project folder. In our case we changed the web server port to 8081 and `postgres` port to 5435:
+If these ports are already in use, change them in `.astro/config.yaml`. For example, set the webserver to 8081 and PostgreSQL to 5435:
 ```yaml
 project:
   name: astro-project
@@ -85,9 +98,9 @@ postgres:
   port: 5435
 ```
 
-To start the project, run `astro dev start`. After Docker containers are spun up, access the Airflow UI at `http://localhost:8081` as illustrated:
+Start the project with `astro dev start`. After the containers start, access the Airflow UI at `http://localhost:8081`:
 
-![Screenshot 2021-11-10 at 14.05.15|690x242](https://us1.discourse-cdn.com/flex020/uploads/crate/original/1X/f298a4c609312133e388555a9eba51733bfd5645.png)
+![Airflow UI landing page](https://us1.discourse-cdn.com/flex020/uploads/crate/original/1X/f298a4c609312133e388555a9eba51733bfd5645.png)
 
 The landing page of Apache Airflow UI shows the list of all DAGs, their status, the time of the next and last run, and the metadata such as the owner and schedule. From the UI, you can manually trigger the DAG with the button in the Actions section, manually pause/unpause DAGs with the toggle button near the DAG name, and filter DAGs by tag. If you click on a specific DAG it will show the graph with tasks and dependencies between each task.
 
@@ -109,11 +122,11 @@ To configure the connection to CrateDB we need to set up a corresponding environ
 
 In this tutorial, we will set up the necessary environment variables via a `.env` file. To learn about alternative ways, please check the [Astronomer documentation](https://docs.astronomer.io/astro/environment-variables). The first variable we set is one for the CrateDB connection, as follows:
 
-`AIRFLOW_CONN_CRATEDB_CONNECTION=postgresql://<CrateDB user name>:<CrateDB user password>@<CrateDB host>/doc?sslmode=disable`
+`AIRFLOW_CONN_CRATEDB_CONNECTION=postgresql://<user>:<password>@<host>/doc?sslmode=disable`
 
 In case a TLS connection is required, change `sslmode=require`. To confirm that a new variable is applied, first, start the Airflow project and then create a bash session in the scheduler container by running `docker exec -it <scheduler_container_name> /bin/bash`.
 
-To check all environment variables that are applied, run `env`.
+Run `env` to list the applied environment variables.
 
 This will output some variables set by Astronomer by default including the variable for the CrateDB connection.
 
diff --git a/docs/integrate/arrow/index.md b/docs/integrate/arrow/index.md
@@ -2,11 +2,18 @@
 (apache-arrow)=
 # Arrow
 
-[Apache Arrow] defines a language-independent columnar memory format for flat
-and nested data, organized for efficient analytic operations on modern
-hardware like CPUs and GPUs. The Arrow memory format also supports
-zero-copy reads for lightning-fast data access without serialization overhead.
+```{div} .float-right
+[![Apache Arrow logo](https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png){height=60px loading=lazy}][Apache Arrow]
+```
+```{div} .clearfix
+```
 
+:::{rubric} About
+:::
+
+[Apache Arrow] defines a language-independent, columnar memory format for flat
+and nested data. It enables efficient analytics on modern CPUs and GPUs and
+supports zero-copy reads, avoiding serialization overhead.
 
 :::{rubric} Learn
 :::
@@ -16,7 +23,7 @@ zero-copy reads for lightning-fast data access without serialization overhead.
 :::{grid-item-card} Tutorial: Import Parquet files
 :link: arrow-import-parquet
 :link-type: ref
-Importing Parquet files into CrateDB using Apache Arrow and SQLAlchemy.
+Import Parquet files into CrateDB with Apache Arrow and SQLAlchemy.
 :::
 
 ::::