Skip to content

Latest commit

 

History

History
628 lines (448 loc) · 22.4 KB

File metadata and controls

628 lines (448 loc) · 22.4 KB

Development Guide

This guide covers setting up your development environment, running migrations, and common development workflows for the ETL project.

Table of Contents

Prerequisites

Before starting, ensure you have the following installed:

Required Tools

  • Rust (latest stable): Install Rust
  • PostgreSQL client (psql): Required for database operations
  • Docker Compose: For running PostgreSQL and other services
  • kubectl: For Kubernetes operations
  • SQLx CLI: For database migrations

Install SQLx CLI:

cargo install --version 0.9.0-alpha.1 sqlx-cli --no-default-features --features rustls,postgres --locked

Optional Tools

  • OrbStack: Recommended for local Kubernetes development (alternative to Docker Desktop)

Task Runner

Common development tasks are available through cargo x, a shorthand alias for cargo xtask. Run cargo x --help to see all available commands.

cargo x fmt              # format code with nightly rustfmt
cargo x fmt --check      # check formatting without changes
cargo x check            # pre-PR gate: fmt, sort, clippy
cargo x fix              # auto-fix: clippy --fix, fmt, sort
cargo x msrv             # verify MSRV consistency
cargo x init             # set up local dev environment
cargo x migrate          # run database migrations
cargo x deploy-local     # deploy replicator to local OrbStack k8s
cargo x test-clickhouse  # run ClickHouse integration tests
cargo x vendor-duckdb    # download and vendor DuckDB extensions

Formatting

The workspace stays on the stable toolchain pinned in rust-toolchain.toml for builds, tests, and linting. Formatting is the only workflow that uses nightly Rust, because the repository relies on nightly-only rustfmt options for import grouping and layout.

cargo x fmt
cargo x fmt --check

Both default to nightly-2026-04-15. You can temporarily override the formatter toolchain with RUSTFMT_NIGHTLY_TOOLCHAIN, but CI and the repository defaults should stay pinned so formatting does not drift.

Quick Start

The fastest way to get started:

# From the project root
cargo x init

This script will:

  1. Start PostgreSQL, ClickHouse, and the local Iceberg dependencies via Docker Compose.
  2. Run etl-api migrations.
  3. Seed the default replicator image.
  4. Configure the Kubernetes environment (OrbStack).

Database Setup

Using the Setup Script

cargo x init provides a complete development environment setup:

# Use default settings (Postgres on port 5430)
cargo x init

# Customize database settings
POSTGRES_PORT=5432 POSTGRES_DB=mydb cargo x init

# Skip Docker if you already have Postgres running
SKIP_DOCKER=1 cargo x init

# Use persistent storage
POSTGRES_DATA_VOLUME=/path/to/data cargo x init

Environment Variables:

Variable Default Description
POSTGRES_USER postgres Database user
POSTGRES_PASSWORD postgres Database password
POSTGRES_DB postgres Database name
POSTGRES_PORT 5430 Database port
POSTGRES_HOST localhost Database host
CLICKHOUSE_HTTP_PORT 8123 ClickHouse HTTP port
CLICKHOUSE_NATIVE_PORT 9000 ClickHouse native TCP port
CLICKHOUSE_USER etl ClickHouse user for the local Docker Compose setup
CLICKHOUSE_PASSWORD etl ClickHouse password for the local Docker Compose setup
SKIP_DOCKER (empty) Skip Docker Compose if set
POSTGRES_DATA_VOLUME (empty) Path for PostgreSQL persistent storage
CLICKHOUSE_DATA_VOLUME (empty) Path for ClickHouse persistent storage
REPLICATOR_IMAGE ramsup/etl-replicator:latest Default replicator image

PostgreSQL 18+ containers store data under /var/lib/postgresql/<major>/data, so the Docker Compose setup mounts the parent /var/lib/postgresql directory to keep upgrades compatible.

The source PostgreSQL container started by cargo x init or cargo xtask postgres start supports TLS by default. The task runner generates a local test CA and server certificate under target/postgres-tls/, then copies the server certificate and key into the container. Local clients may still connect without TLS; set TESTS_DATABASE_TLS_ENABLED=true when running tests to require verified TLS using the generated root certificate.

The same Docker Compose stack also starts ClickHouse on http://localhost:8123 by default, which is enough for local destination development and ClickHouse integration tests.

Manual Setup

If you prefer manual setup or have an existing PostgreSQL instance:

Important: The etl-api migrations and ETL source/store migrations can run on separate databases. You might have:

  • The etl-api using its own dedicated Postgres instance for the control plane
  • The ETL source helpers and Postgres store tables on the database you're replicating from (source database)
  • Or both on the same database (for simpler local development setups)

Single Database Setup

If using one database for both the API and ETL source/store objects:

export DATABASE_URL=postgres://USER:PASSWORD@HOST:PORT/DB

# Run all migrations on the same database
cargo x migrate

Separate Database Setup

If using separate databases (recommended for production):

# API migrations on the control plane database
export DATABASE_URL=postgres://USER:PASSWORD@API_HOST:PORT/API_DB
cargo x migrate etl-api

# ETL migrations on the source database
export DATABASE_URL=postgres://USER:PASSWORD@SOURCE_HOST:PORT/SOURCE_DB
cargo x migrate etl

This separation allows you to:

  • Scale the control plane independently from replication workloads
  • Keep ETL source/store objects close to the source data
  • Isolate concerns between infrastructure management and data replication

Database Migrations

The project uses SQLx for database migrations. There are two sets of migrations:

ETL API Migrations

Located in crates/etl-api/migrations/, these create the control plane schema (app schema) for managing tenants, sources, destinations, and pipelines.

Running API migrations:

# From project root
cargo x migrate etl-api

# Or manually with SQLx CLI
sqlx migrate run --source crates/etl-api/migrations

Creating a new API migration:

cd crates/etl-api
sqlx migrate add <migration_name>

Resetting the API database:

cd crates/etl-api
sqlx migrate revert

Updating SQLx metadata after schema changes:

cd crates/etl-api
cargo sqlx prepare

ETL Source And Store Migrations

Located under crates/etl/migrations/, these prepare the source database:

  • crates/etl/migrations/source/: ETL source helpers required by every pipeline, such as schema snapshot functions and the DDL event trigger. Pipeline::start() runs these automatically.
  • crates/etl/migrations/postgres_store/: Postgres-backed state store tables used to persist replication state, versioned table schemas, and destination metadata. PostgresStore::new() runs these automatically.

Both migration sets write to etl._sqlx_migrations. When running them separately, always use SQLx's --ignore-missing flag so each migrator validates its own versions while ignoring versions owned by the other set.

Do not edit an already-applied migration file, including comments or whitespace. SQLx stores a SHA-384 checksum of the full migration contents, so even comment-only changes will break existing databases with a checksum mismatch.

Running ETL migrations manually:

# From project root
cargo x migrate etl

# Or manually with SQLx CLI (requires setting search_path)
psql $DATABASE_URL -c "create schema if not exists etl;"
sqlx migrate run --source crates/etl/migrations/postgres_store --database-url "${DATABASE_URL}?options=-csearch_path%3Detl" --ignore-missing
sqlx migrate run --source crates/etl/migrations/source --database-url "${DATABASE_URL}?options=-csearch_path%3Detl" --ignore-missing

Reverting ETL migrations manually:

# Revert source migrations.
sqlx migrate revert --source crates/etl/migrations/source --database-url "${DATABASE_URL}?options=-csearch_path%3Detl" --ignore-missing

# Revert Postgres store migrations.
sqlx migrate revert --source crates/etl/migrations/postgres_store --database-url "${DATABASE_URL}?options=-csearch_path%3Detl" --ignore-missing

Use --target-version 0 to revert every migration in one migration set. Revert source and Postgres store migrations separately because ordering is scoped to the selected migration folder.

Important: Migrations are run automatically at the appropriate runtime boundary: source migrations when a pipeline starts, and Postgres store migrations when the Postgres-backed state store is initialized. However, if you integrate the etl crate directly into your own application and want to prepare the source database ahead of time, you can also run these migrations manually. This design decision ensures:

  • The standalone replicator binary works out-of-the-box
  • Library users have explicit control over when migrations run
  • CI/CD pipelines can pre-apply migrations independently

When to run migrations manually:

  • Integrating etl as a library in your own application
  • Pre-creating the replication state store schema before deployment
  • Testing migrations independently
  • CI/CD pipelines that separate migration and deployment steps

Creating a new Postgres state store migration:

cd crates/etl
sqlx migrate add -r --source migrations/postgres_store <migration_name>

Creating a new ETL source migration:

cd crates/etl
sqlx migrate add -r --source migrations/source <migration_name>

Running the Services

Both etl-api and etl-replicator binaries use hierarchical configuration loading from the configuration/ directory within each crate. Configuration is loaded in this order:

  1. Base configuration: configuration/base.yaml (always loaded)
  2. Environment-specific: configuration/{environment}.yaml (e.g., dev.yaml, prod.yaml)
  3. Environment variable overrides: Prefixed with APP_ (e.g., APP_DATABASE__URL)

Environment Selection:

The environment is determined by the APP_ENVIRONMENT variable:

  • Default: prod (if APP_ENVIRONMENT is not set)
  • Available: dev, staging, prod
# Run with dev environment
APP_ENVIRONMENT=dev cargo run

# Run with production environment (default)
cargo run

# Override specific config values
APP_ENVIRONMENT=dev APP_DATABASE__URL=postgres://localhost/mydb cargo run

ETL API

Running from Source

cd crates/etl-api
APP_ENVIRONMENT=dev cargo run

The API loads configuration from crates/etl-api/configuration/{environment}.yaml. See crates/etl-api/README.md for available configuration options.

Running with Docker

Docker images are available for the etl-api. You must mount the configuration files and can override settings via environment variables:

docker run \
  -v $(pwd)/crates/etl-api/configuration/base.yaml:/app/configuration/base.yaml \
  -v $(pwd)/crates/etl-api/configuration/dev.yaml:/app/configuration/dev.yaml \
  -e APP_ENVIRONMENT=dev \
  -p 8080:8080 \
  ramsup/etl-api:latest

Configuration requirements:

  • Mount both base.yaml and your environment-specific config file (e.g., dev.yaml)
  • Set APP_ENVIRONMENT to match your mounted environment file
  • Override specific values using APP_ prefixed environment variables

Kubernetes Setup (ETL API Only)

The etl-api manages replicator deployments on Kubernetes by dynamically creating StatefulSets, Secrets, and ConfigMaps. The etl-api requires Kubernetes, but the etl-replicator binary can run independently without any Kubernetes setup.

Prerequisites:

  • OrbStack with Kubernetes enabled (or another local Kubernetes cluster)
  • kubectl configured with the orbstack context
  • Pre-defined Kubernetes resources (see below)

Required Pre-Defined Resources:

The etl-api expects these resources to exist before it can deploy replicators:

  1. Namespace: etl-data-plane - Where all replicator pods and related resources are created
  2. ConfigMap: trusted-root-certs-config - Provides trusted root certificates for TLS connections

These are defined in scripts/ and should be applied before running the API:

kubectl --context orbstack apply -f scripts/etl-data-plane.yaml
kubectl --context orbstack apply -f scripts/trusted-root-certs-config.yaml

Note: For the complete list of expected Kubernetes resources and their specifications, refer to the constants and resource creation logic in crates/etl-api/src/k8s/http.rs.

ETL Replicator

The replicator can run as a standalone binary without Kubernetes.

Running from Source

cd crates/etl-replicator
APP_ENVIRONMENT=dev cargo run

The replicator loads configuration from crates/etl-replicator/configuration/{environment}.yaml.

Running with Docker

Docker images are available for the etl-replicator. You must mount the configuration files and can override settings via environment variables:

docker run \
  -v $(pwd)/crates/etl-replicator/configuration/base.yaml:/app/configuration/base.yaml \
  -v $(pwd)/crates/etl-replicator/configuration/dev.yaml:/app/configuration/dev.yaml \
  -e APP_ENVIRONMENT=dev \
  etl-replicator:latest

Configuration requirements:

  • Mount both base.yaml and your environment-specific config file (e.g., dev.yaml)
  • Set APP_ENVIRONMENT to match your mounted environment file
  • Override specific values using APP_ prefixed environment variables

Note: While the replicator is typically deployed as a Kubernetes pod managed by the etl-api, it does not require Kubernetes to function. You can run it as a standalone process on any machine with the appropriate configuration.

Running Tests

The project includes comprehensive test suites that require a PostgreSQL database. Tests use environment variables for database configuration to ensure isolation and reproducibility.

Test Environment Variables

PostgreSQL Test Variables

All tests that interact with PostgreSQL require the following environment variables to be set:

Variable Required Description
TESTS_DATABASE_HOST Yes PostgreSQL server hostname (e.g., localhost)
TESTS_DATABASE_PORT Yes PostgreSQL server port (e.g., 5430)
TESTS_DATABASE_USERNAME Yes Database user (e.g., postgres)
TESTS_DATABASE_PASSWORD No Database password (optional)
TESTS_DATABASE_TLS_ENABLED No Require verified TLS for Postgres test clients when set to true
TESTS_DATABASE_TLS_ROOT_CERT No Path to the trusted root certificate; defaults to target/postgres-tls/root.crt

Note: Each test creates a unique database with a UUID-based name to ensure test isolation. The test databases are automatically cleaned up after tests complete.

BigQuery Test Variables

BigQuery destination tests require Google Cloud credentials:

Variable Required Description
TESTS_BIGQUERY_PROJECT_ID Yes GCP project ID for BigQuery
TESTS_BIGQUERY_SA_KEY_PATH Yes Path to service account JSON key file

Note: BigQuery tests are only run when the bigquery and test-utils features are enabled. Each test creates a unique dataset with a UUID-based name for isolation.

Iceberg Test Variables

Iceberg destination tests use local MinIO and Lakekeeper instances. The following services must be running:

  • Lakekeeper: http://localhost:8182 (REST catalog)
  • MinIO: http://localhost:9010 (S3-compatible storage)
    • Username: minio-admin
    • Password: minio-admin-password

Note: Iceberg tests are only run when the iceberg and test-utils features are enabled. These use hardcoded local URLs and do not require environment variables.

ClickHouse Test Variables

ClickHouse destination tests require a reachable ClickHouse HTTP endpoint:

Variable Required Description
TESTS_CLICKHOUSE_URL Yes ClickHouse HTTP URL (for example, http://localhost:8123)
TESTS_CLICKHOUSE_USER Yes ClickHouse user name (for the local Docker Compose setup, use etl)
TESTS_CLICKHOUSE_PASSWORD No ClickHouse password; for the local Docker Compose setup, use etl

Note: ClickHouse tests are only run when the clickhouse and test-utils features are enabled. Each test creates a unique database in ClickHouse and drops it automatically when the test finishes. The Docker Compose setup started by cargo x init is sufficient for these tests.

Test Output and Logging

Variable Description
ENABLE_TRACING=1 Enable tracing output during test execution (useful for debugging)
RUST_LOG Control log level (e.g., debug, info, warn, error)

Example:

# Run tests with debug output
ENABLE_TRACING=1 RUST_LOG=debug cargo test test_name -- --nocapture

Setting Up Test Environment

Option 1: Inline Environment Variables (Recommended)

The most reliable way is to set environment variables directly in the test command:

TESTS_DATABASE_HOST=localhost TESTS_DATABASE_PORT=5430 TESTS_DATABASE_USERNAME=postgres TESTS_DATABASE_PASSWORD=postgres cargo test -p etl-api

Option 2: Export in Current Shell Session

Export variables in your current shell session, then run tests:

# PostgreSQL test configuration
export TESTS_DATABASE_HOST=localhost
export TESTS_DATABASE_PORT=5430
export TESTS_DATABASE_USERNAME=postgres
export TESTS_DATABASE_PASSWORD=postgres
# Optional when using the local Docker Compose Postgres from cargo x init.
export TESTS_DATABASE_TLS_ENABLED=true

# BigQuery test configuration (optional - only needed for BigQuery tests)
export TESTS_BIGQUERY_PROJECT_ID=your-gcp-project-id
export TESTS_BIGQUERY_SA_KEY_PATH=/path/to/service-account-key.json

# ClickHouse test configuration (optional - only needed for ClickHouse tests)
export TESTS_CLICKHOUSE_URL=http://localhost:8123
export TESTS_CLICKHOUSE_USER=etl
export TESTS_CLICKHOUSE_PASSWORD=etl

# Enable test output (optional)
export ENABLE_TRACING=1
export RUST_LOG=info

# Now run tests
cargo test -p etl-api

Option 3: Use a .env File

Create a .env.test file and source it:

# .env.test

# PostgreSQL (required for most tests)
TESTS_DATABASE_HOST=localhost
TESTS_DATABASE_PORT=5430
TESTS_DATABASE_USERNAME=postgres
TESTS_DATABASE_PASSWORD=postgres

# BigQuery (optional - only for BigQuery tests)
TESTS_BIGQUERY_PROJECT_ID=your-gcp-project-id
TESTS_BIGQUERY_SA_KEY_PATH=/path/to/service-account-key.json

# ClickHouse (optional - only for ClickHouse tests)
TESTS_CLICKHOUSE_URL=http://localhost:8123
TESTS_CLICKHOUSE_USER=etl
TESTS_CLICKHOUSE_PASSWORD=etl

# Test output (optional)
ENABLE_TRACING=1
RUST_LOG=info
# Source the file and run tests
source .env.test
cargo test -p etl-api

Running Tests

Important: Environment variables must be set in the same command as cargo test, or exported in your current shell session before running tests.

# Run all tests (requires env variables)
TESTS_DATABASE_HOST=localhost TESTS_DATABASE_PORT=5430 TESTS_DATABASE_USERNAME=postgres TESTS_DATABASE_PASSWORD=postgres cargo test

# Run tests for a specific package
TESTS_DATABASE_HOST=localhost TESTS_DATABASE_PORT=5430 TESTS_DATABASE_USERNAME=postgres TESTS_DATABASE_PASSWORD=postgres cargo test -p etl-api

# Run tests for packages with test-utils feature (etl, etl-postgres, etl-destinations)
TESTS_DATABASE_HOST=localhost TESTS_DATABASE_PORT=5430 TESTS_DATABASE_USERNAME=postgres TESTS_DATABASE_PASSWORD=postgres cargo test -p etl --features test-utils

# Run a specific test
TESTS_DATABASE_HOST=localhost TESTS_DATABASE_PORT=5430 TESTS_DATABASE_USERNAME=postgres TESTS_DATABASE_PASSWORD=postgres cargo test -p etl-api --test tenants tenant_can_be_created

# Run tests with tracing output for debugging
TESTS_DATABASE_HOST=localhost TESTS_DATABASE_PORT=5430 TESTS_DATABASE_USERNAME=postgres TESTS_DATABASE_PASSWORD=postgres ENABLE_TRACING=1 RUST_LOG=info cargo test -p etl-api --test tenants tenant_can_be_created -- --nocapture

# Run the ClickHouse destination integration test against the local Docker Compose service
TESTS_DATABASE_HOST=localhost TESTS_DATABASE_PORT=5430 TESTS_DATABASE_USERNAME=postgres TESTS_DATABASE_PASSWORD=postgres TESTS_CLICKHOUSE_URL=http://localhost:8123 TESTS_CLICKHOUSE_USER=etl TESTS_CLICKHOUSE_PASSWORD=etl cargo test -p etl-destinations --features clickhouse,test-utils clickhouse_pipeline -- --nocapture

Packages requiring --features test-utils:

  • etl
  • etl-postgres
  • etl-destinations

Packages that don't require feature flags:

  • etl-api
  • etl-config
  • etl-telemetry
  • etl-replicator

Note: Ensure PostgreSQL is running and accessible at the configured host and port before running tests. The test suite will fail if it cannot connect to the database or if the required environment variables are not set.

Troubleshooting

Database Connection Issues

If you encounter connection issues:

  1. Verify PostgreSQL is running:

    docker-compose -f scripts/docker-compose.yaml ps
  2. Check the connection:

    psql $DATABASE_URL -c "SELECT 1;"
  3. Ensure the correct port is used (default: 5430)

Migration Issues

If migrations fail:

  1. Check if the database exists:

    psql $DATABASE_URL -c "\l"
  2. Verify SQLx CLI is installed:

    sqlx --version
  3. Check migration history:

    psql $DATABASE_URL -c "SELECT * FROM _sqlx_migrations;"

Kubernetes Issues

If Kubernetes resources aren't deploying:

  1. Verify context:

    kubectl config current-context
  2. Check cluster status:

    kubectl cluster-info
  3. View events:

    kubectl get events -n etl-control-plane --sort-by='.lastTimestamp'