Skip to content

Commit b39fe56

Browse files
committed
Airflow: Implement suggestions by CodeRabbit, part 2
1 parent 25cf79b commit b39fe56

File tree

2 files changed

+50
-30
lines changed

2 files changed

+50
-30
lines changed

docs/integrate/airflow/getting-started.md

Lines changed: 38 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -7,46 +7,59 @@ Automate CrateDB queries with Apache Airflow.
77

88
## Introduction
99

10-
This first article shows how to use [Apache Airflow] with CrateDB to automate recurring queries.
10+
This guide shows how to use [Apache Airflow] with CrateDB to automate recurring queries.
1111

12-
Then, we cover [Astronomer], the managed Apache Airflow provider, followed
13-
by instructions on how to set up the project with [Astronomer CLI].
14-
Finally, we illustrate with relatively simple examples how to schedule and
15-
execute recurring queries.
12+
You will:
13+
- understand [Astronomer], a managed Apache Airflow platform,
14+
- set up a local project with the [Astronomer CLI], and
15+
- schedule and execute recurring queries with simple examples.
1616

1717
:::{rubric} Apache Airflow
1818
:::
19-
Apache Airflow is a platform for programmatically creating, scheduling, and monitoring workflows \[[Official documentation](https://airflow.apache.org/docs/)\]. Workflows are defined as directed acyclic graphs (DAGs) where each node in DAG represents an execution task. It is worth mentioning that each task is executed independently of other tasks and the purpose of DAG is to track the relationships between tasks. DAGs are designed to run on demand and in data intervals (e.g., twice a week).
19+
Apache Airflow programmatically creates, schedules, and monitors workflows
20+
\[[Official documentation](https://airflow.apache.org/docs/)\]. A workflow
21+
is a directed acyclic graph (DAG) where each node represents a task. Each
22+
task runs independently; the DAG tracks dependencies. Run DAGs on demand
23+
or on schedules (for example, twice a week).
2024

2125
:::{rubric} CrateDB
2226
:::
23-
CrateDB is an open-source distributed database that makes storage and analysis of massive amounts of data simple and efficient. CrateDB offers a high degree of scalability, flexibility, and availability. It supports dynamic schemas, queryable objects, time-series data support, and real-time full-text search over millions of documents in just a few seconds.
27+
CrateDB is an open-source, distributed database for storing and analyzing
28+
large volumes of data. It offers high scalability, flexibility, and
29+
availability, supports dynamic schemas and queryable objects, and provides
30+
time series features and real-time full-text search over millions of
31+
documents in seconds.
2432

25-
As CrateDB is designed to store and analyze massive amounts of data, continuous use of such data is a crucial task in many production applications of CrateDB. Needless to say, Apache Airflow is one of the most heavily used tools for the automation of big data pipelines. It has a very resilient architecture and scalable design. This makes Airflow an excellent tool for the automation of recurring tasks that run on CrateDB.
33+
Because CrateDB powers large-scale data workloads, many deployments automate
34+
recurring tasks. Apache Airflow’s resilient, scalable architecture makes it
35+
a strong choice for orchestrating those tasks on CrateDB.
2636

2737
:::{rubric} Astronomer
2838
:::
29-
Since its inception in 2014, the complexity of Apache Airflow and its features has grown significantly. To run Airflow in production, it is no longer sufficient to know only Airflow, but also the underlying infrastructure used for Airflow deployment.
39+
Since 2014, Apache Airflow and its ecosystem have grown significantly. To run Airflow in production, you need to understand both Airflow and the underlying deployment infrastructure.
3040

31-
To help maintain complex environments, one can use managed Apache Airflow providers such as Astronomer. Astronomer is one of the main managed providers that allows users to easily run and monitor Apache Airflow deployments. It runs on Kubernetes, abstracts all underlying infrastructure details, and provides a clean interface for constructing and managing different workflows.
41+
To simplify operations, use a managed Apache Airflow provider such as Astronomer. Astronomer runs on Kubernetes, abstracts infrastructure details, and provides a clean interface for building and operating workflows.
3242

33-
## Setting up an Airflow project
34-
We set up a new Airflow project on an 8-core machine with 30GB RAM running Ubuntu 22.04 LTS. To initialize the project we use Astronomer CLI. The installation process requires [Docker](https://www.docker.com/) version 18.09 or higher. To install the latest version of the Astronomer CLI on Ubuntu, run:
43+
## Set up a local Airflow project
3544

36-
`curl -sSL install.astronomer.io | sudo bash -s`
37-
38-
To make sure that you installed Astronomer CLI on your machine, run:
45+
The examples use an 8‑core machine with 30 GB RAM on Ubuntu 22.04 LTS. Install the Astronomer CLI (requires [Docker](https://www.docker.com/) 18.09+). On Ubuntu:
46+
```shell
47+
curl -sSL install.astronomer.io | sudo bash -s
48+
```
3949

40-
`astro version`
50+
Verify the installation:
51+
```shell
52+
astro version
53+
```
4154

42-
If the installation was successful, you will see the output similar to:
55+
Example output:
4356

4457
`Astro CLI Version: 1.14.1`
4558

46-
To install Astronomer CLI on another operating system, follow the [official documentation](https://www.astronomer.io/docs/astro/cli/install-cli).
47-
After the successful installation of Astronomer CLI, create and initialize the new project as follows:
59+
For other operating systems, follow the [official documentation](https://www.astronomer.io/docs/astro/cli/install-cli).
60+
After installing the Astronomer CLI, initialize a new project:
4861

49-
* Create project directory:
62+
* Create a project directory:
5063
```bash
5164
mkdir astro-project && cd astro-project
5265
```
@@ -75,7 +88,7 @@ The astronomer project consists of four Docker containers:
7588

7689
The PostgreSQL server is configured to listen on port 5432. The web server is listening on port 8080 and can be accessed via http://localhost:8080/ with `admin` for both username and password.
7790

78-
In case these ports are already occupied you can change them in the file `.astro/config.yaml` inside the project folder. In our case we changed the web server port to 8081 and `postgres` port to 5435:
91+
If these ports are already in use, change them in `.astro/config.yaml`. For example, set the webserver to 8081 and PostgreSQL to 5435:
7992
```yaml
8093
project:
8194
name: astro-project
@@ -85,9 +98,9 @@ postgres:
8598
port: 5435
8699
```
87100
88-
To start the project, run `astro dev start`. After Docker containers are spun up, access the Airflow UI at `http://localhost:8081` as illustrated:
101+
Start the project with `astro dev start`. After the containers start, access the Airflow UI at `http://localhost:8081`:
89102

90-
![Screenshot 2021-11-10 at 14.05.15|690x242](https://us1.discourse-cdn.com/flex020/uploads/crate/original/1X/f298a4c609312133e388555a9eba51733bfd5645.png)
103+
![Airflow UI landing page](https://us1.discourse-cdn.com/flex020/uploads/crate/original/1X/f298a4c609312133e388555a9eba51733bfd5645.png)
91104

92105
The landing page of Apache Airflow UI shows the list of all DAGs, their status, the time of the next and last run, and the metadata such as the owner and schedule. From the UI, you can manually trigger the DAG with the button in the Actions section, manually pause/unpause DAGs with the toggle button near the DAG name, and filter DAGs by tag. If you click on a specific DAG it will show the graph with tasks and dependencies between each task.
93106

@@ -109,11 +122,11 @@ To configure the connection to CrateDB we need to set up a corresponding environ
109122

110123
In this tutorial, we will set up the necessary environment variables via a `.env` file. To learn about alternative ways, please check the [Astronomer documentation](https://docs.astronomer.io/astro/environment-variables). The first variable we set is one for the CrateDB connection, as follows:
111124

112-
`AIRFLOW_CONN_CRATEDB_CONNECTION=postgresql://<CrateDB user name>:<CrateDB user password>@<CrateDB host>/doc?sslmode=disable`
125+
`AIRFLOW_CONN_CRATEDB_CONNECTION=postgresql://<user>:<password>@<host>/doc?sslmode=disable`
113126

114127
In case a TLS connection is required, change `sslmode=require`. To confirm that a new variable is applied, first, start the Airflow project and then create a bash session in the scheduler container by running `docker exec -it <scheduler_container_name> /bin/bash`.
115128

116-
To check all environment variables that are applied, run `env`.
129+
Run `env` to list the applied environment variables.
117130

118131
This will output some variables set by Astronomer by default including the variable for the CrateDB connection.
119132

docs/integrate/arrow/index.md

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,18 @@
22
(apache-arrow)=
33
# Arrow
44

5-
[Apache Arrow] defines a language-independent columnar memory format for flat
6-
and nested data, organized for efficient analytic operations on modern
7-
hardware like CPUs and GPUs. The Arrow memory format also supports
8-
zero-copy reads for lightning-fast data access without serialization overhead.
5+
```{div} .float-right
6+
[![Apache Arrow logo](https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png){height=60px loading=lazy}][Apache Arrow]
7+
```
8+
```{div} .clearfix
9+
```
910

11+
:::{rubric} About
12+
:::
13+
14+
[Apache Arrow] defines a language-independent, columnar memory format for flat
15+
and nested data. It enables efficient analytics on modern CPUs and GPUs and
16+
supports zero-copy reads, avoiding serialization overhead.
1017

1118
:::{rubric} Learn
1219
:::
@@ -16,7 +23,7 @@ zero-copy reads for lightning-fast data access without serialization overhead.
1623
:::{grid-item-card} Tutorial: Import Parquet files
1724
:link: arrow-import-parquet
1825
:link-type: ref
19-
Importing Parquet files into CrateDB using Apache Arrow and SQLAlchemy.
26+
Import Parquet files into CrateDB with Apache Arrow and SQLAlchemy.
2027
:::
2128

2229
::::

0 commit comments

Comments
 (0)