You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/integrate/airflow/getting-started.md
+38-25Lines changed: 38 additions & 25 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,46 +7,59 @@ Automate CrateDB queries with Apache Airflow.
7
7
8
8
## Introduction
9
9
10
-
This first article shows how to use [Apache Airflow] with CrateDB to automate recurring queries.
10
+
This guide shows how to use [Apache Airflow] with CrateDB to automate recurring queries.
11
11
12
-
Then, we cover [Astronomer], the managed Apache Airflow provider, followed
13
-
by instructions on how to set up the project with [Astronomer CLI].
14
-
Finally, we illustrate with relatively simple examples how to schedule and
15
-
execute recurring queries.
12
+
You will:
13
+
- understand [Astronomer], a managed Apache Airflow platform,
14
+
- set up a local project with the [Astronomer CLI], and
15
+
- schedule and execute recurring queries with simple examples.
16
16
17
17
:::{rubric} Apache Airflow
18
18
:::
19
-
Apache Airflow is a platform for programmatically creating, scheduling, and monitoring workflows \[[Official documentation](https://airflow.apache.org/docs/)\]. Workflows are defined as directed acyclic graphs (DAGs) where each node in DAG represents an execution task. It is worth mentioning that each task is executed independently of other tasks and the purpose of DAG is to track the relationships between tasks. DAGs are designed to run on demand and in data intervals (e.g., twice a week).
19
+
Apache Airflow programmatically creates, schedules, and monitors workflows
20
+
\[[Official documentation](https://airflow.apache.org/docs/)\]. A workflow
21
+
is a directed acyclic graph (DAG) where each node represents a task. Each
22
+
task runs independently; the DAG tracks dependencies. Run DAGs on demand
23
+
or on schedules (for example, twice a week).
20
24
21
25
:::{rubric} CrateDB
22
26
:::
23
-
CrateDB is an open-source distributed database that makes storage and analysis of massive amounts of data simple and efficient. CrateDB offers a high degree of scalability, flexibility, and availability. It supports dynamic schemas, queryable objects, time-series data support, and real-time full-text search over millions of documents in just a few seconds.
27
+
CrateDB is an open-source, distributed database for storing and analyzing
28
+
large volumes of data. It offers high scalability, flexibility, and
29
+
availability, supports dynamic schemas and queryable objects, and provides
30
+
time series features and real-time full-text search over millions of
31
+
documents in seconds.
24
32
25
-
As CrateDB is designed to store and analyze massive amounts of data, continuous use of such data is a crucial task in many production applications of CrateDB. Needless to say, Apache Airflow is one of the most heavily used tools for the automation of big data pipelines. It has a very resilient architecture and scalable design. This makes Airflow an excellent tool for the automation of recurring tasks that run on CrateDB.
33
+
Because CrateDB powers large-scale data workloads, many deployments automate
34
+
recurring tasks. Apache Airflow’s resilient, scalable architecture makes it
35
+
a strong choice for orchestrating those tasks on CrateDB.
26
36
27
37
:::{rubric} Astronomer
28
38
:::
29
-
Since its inception in 2014, the complexity of Apache Airflow and its features has grown significantly. To run Airflow in production, it is no longer sufficient to know only Airflow, but also the underlying infrastructure used for Airflow deployment.
39
+
Since 2014, Apache Airflow and its ecosystem have grown significantly. To run Airflow in production, you need to understand both Airflow and the underlying deployment infrastructure.
30
40
31
-
To help maintain complex environments, one can use managed Apache Airflow providers such as Astronomer. Astronomer is one of the main managed providers that allows users to easily run and monitor Apache Airflow deployments. It runs on Kubernetes, abstracts all underlying infrastructure details, and provides a clean interface for constructing and managing different workflows.
41
+
To simplify operations, use a managed Apache Airflow provider such as Astronomer. Astronomer runs on Kubernetes, abstracts infrastructure details, and provides a clean interface for building and operating workflows.
32
42
33
-
## Setting up an Airflow project
34
-
We set up a new Airflow project on an 8-core machine with 30GB RAM running Ubuntu 22.04 LTS. To initialize the project we use Astronomer CLI. The installation process requires [Docker](https://www.docker.com/) version 18.09 or higher. To install the latest version of the Astronomer CLI on Ubuntu, run:
43
+
## Set up a local Airflow project
35
44
36
-
`curl -sSL install.astronomer.io | sudo bash -s`
37
-
38
-
To make sure that you installed Astronomer CLI on your machine, run:
45
+
The examples use an 8‑core machine with 30 GB RAM on Ubuntu 22.04 LTS. Install the Astronomer CLI (requires [Docker](https://www.docker.com/) 18.09+). On Ubuntu:
46
+
```shell
47
+
curl -sSL install.astronomer.io | sudo bash -s
48
+
```
39
49
40
-
`astro version`
50
+
Verify the installation:
51
+
```shell
52
+
astro version
53
+
```
41
54
42
-
If the installation was successful, you will see the output similar to:
55
+
Example output:
43
56
44
57
`Astro CLI Version: 1.14.1`
45
58
46
-
To install Astronomer CLI on another operating system, follow the [official documentation](https://www.astronomer.io/docs/astro/cli/install-cli).
47
-
After the successful installation of Astronomer CLI, create and initialize the new project as follows:
59
+
For other operating systems, follow the [official documentation](https://www.astronomer.io/docs/astro/cli/install-cli).
60
+
After installing the Astronomer CLI, initialize a new project:
48
61
49
-
* Create project directory:
62
+
* Create a project directory:
50
63
```bash
51
64
mkdir astro-project &&cd astro-project
52
65
```
@@ -75,7 +88,7 @@ The astronomer project consists of four Docker containers:
75
88
76
89
The PostgreSQL server is configured to listen on port 5432. The web server is listening on port 8080 and can be accessed via http://localhost:8080/ with `admin` for both username and password.
77
90
78
-
In case these ports are already occupied you can change them in the file `.astro/config.yaml` inside the project folder. In our case we changed the web server port to 8081 and `postgres` port to 5435:
91
+
If these ports are already in use, change them in `.astro/config.yaml`. For example, set the webserver to 8081 and PostgreSQL to 5435:
79
92
```yaml
80
93
project:
81
94
name: astro-project
@@ -85,9 +98,9 @@ postgres:
85
98
port: 5435
86
99
```
87
100
88
-
To start the project, run`astro dev start`. After Docker containers are spun up, access the Airflow UI at `http://localhost:8081` as illustrated:
101
+
Start the project with`astro dev start`. After the containers start, access the Airflow UI at `http://localhost:8081`:
89
102
90
-

The landing page of Apache Airflow UI shows the list of all DAGs, their status, the time of the next and last run, and the metadata such as the owner and schedule. From the UI, you can manually trigger the DAG with the button in the Actions section, manually pause/unpause DAGs with the toggle button near the DAG name, and filter DAGs by tag. If you click on a specific DAG it will show the graph with tasks and dependencies between each task.
93
106
@@ -109,11 +122,11 @@ To configure the connection to CrateDB we need to set up a corresponding environ
109
122
110
123
In this tutorial, we will set up the necessary environment variables via a `.env` file. To learn about alternative ways, please check the [Astronomer documentation](https://docs.astronomer.io/astro/environment-variables). The first variable we set is one for the CrateDB connection, as follows:
111
124
112
-
`AIRFLOW_CONN_CRATEDB_CONNECTION=postgresql://<CrateDB user name>:<CrateDB user password>@<CrateDB host>/doc?sslmode=disable`
In case a TLS connection is required, change `sslmode=require`. To confirm that a new variable is applied, first, start the Airflow project and then create a bash session in the scheduler container by running `docker exec -it <scheduler_container_name> /bin/bash`.
115
128
116
-
To check all environment variables that are applied, run `env`.
129
+
Run `env` to list the applied environment variables.
117
130
118
131
This will output some variables set by Astronomer by default including the variable for the CrateDB connection.
0 commit comments