This guide covers setting up your development environment, running migrations, and common development workflows for the ETL project.
- Task Runner
- Prerequisites
- Quick Start
- Database Setup
- Database Migrations
- Running the Services
- Kubernetes Setup
- Common Development Tasks
Before starting, ensure you have the following installed:
- Rust (latest stable): Install Rust
- PostgreSQL client (
psql): Required for database operations - Docker Compose: For running PostgreSQL and other services
- kubectl: For Kubernetes operations
- SQLx CLI: For database migrations
Install SQLx CLI:
cargo install --version 0.9.0-alpha.1 sqlx-cli --no-default-features --features rustls,postgres --locked- OrbStack: Recommended for local Kubernetes development (alternative to Docker Desktop)
- Install OrbStack
- Enable Kubernetes in OrbStack settings
Common development tasks are available through cargo x, a shorthand alias for cargo xtask.
Run cargo x --help to see all available commands.
cargo x fmt # format code with nightly rustfmt
cargo x fmt --check # check formatting without changes
cargo x check # pre-PR gate: fmt, sort, clippy
cargo x fix # auto-fix: clippy --fix, fmt, sort
cargo x msrv # verify MSRV consistency
cargo x init # set up local dev environment
cargo x migrate # run database migrations
cargo x deploy-local # deploy replicator to local OrbStack k8s
cargo x test-clickhouse # run ClickHouse integration tests
cargo x vendor-duckdb # download and vendor DuckDB extensionsThe workspace stays on the stable toolchain pinned in rust-toolchain.toml for builds, tests, and linting.
Formatting is the only workflow that uses nightly Rust, because the repository relies on nightly-only
rustfmt options for import grouping and layout.
cargo x fmt
cargo x fmt --checkBoth default to nightly-2026-04-15. You can temporarily override the formatter toolchain with
RUSTFMT_NIGHTLY_TOOLCHAIN, but CI and the repository defaults should stay pinned so formatting does not drift.
The fastest way to get started:
# From the project root
cargo x initThis script will:
- Start PostgreSQL, ClickHouse, and the local Iceberg dependencies via Docker Compose.
- Run etl-api migrations.
- Seed the default replicator image.
- Configure the Kubernetes environment (OrbStack).
cargo x init provides a complete development environment setup:
# Use default settings (Postgres on port 5430)
cargo x init
# Customize database settings
POSTGRES_PORT=5432 POSTGRES_DB=mydb cargo x init
# Skip Docker if you already have Postgres running
SKIP_DOCKER=1 cargo x init
# Use persistent storage
POSTGRES_DATA_VOLUME=/path/to/data cargo x initEnvironment Variables:
| Variable | Default | Description |
|---|---|---|
POSTGRES_USER |
postgres |
Database user |
POSTGRES_PASSWORD |
postgres |
Database password |
POSTGRES_DB |
postgres |
Database name |
POSTGRES_PORT |
5430 |
Database port |
POSTGRES_HOST |
localhost |
Database host |
CLICKHOUSE_HTTP_PORT |
8123 |
ClickHouse HTTP port |
CLICKHOUSE_NATIVE_PORT |
9000 |
ClickHouse native TCP port |
CLICKHOUSE_USER |
etl |
ClickHouse user for the local Docker Compose setup |
CLICKHOUSE_PASSWORD |
etl |
ClickHouse password for the local Docker Compose setup |
SKIP_DOCKER |
(empty) | Skip Docker Compose if set |
POSTGRES_DATA_VOLUME |
(empty) | Path for PostgreSQL persistent storage |
CLICKHOUSE_DATA_VOLUME |
(empty) | Path for ClickHouse persistent storage |
REPLICATOR_IMAGE |
ramsup/etl-replicator:latest |
Default replicator image |
PostgreSQL 18+ containers store data under /var/lib/postgresql/<major>/data, so the Docker Compose setup mounts the parent /var/lib/postgresql directory to keep upgrades compatible.
The source PostgreSQL container started by cargo x init or cargo xtask postgres start supports TLS by default. The task runner generates a local test CA and server certificate under target/postgres-tls/, then copies the server certificate and key into the container. Local clients may still connect without TLS; set TESTS_DATABASE_TLS_ENABLED=true when running tests to require verified TLS using the generated root certificate.
The same Docker Compose stack also starts ClickHouse on http://localhost:8123 by default, which is enough for local destination development and ClickHouse integration tests.
If you prefer manual setup or have an existing PostgreSQL instance:
Important: The etl-api migrations and ETL source/store migrations can run on separate databases. You might have:
- The etl-api using its own dedicated Postgres instance for the control plane
- The ETL source helpers and Postgres store tables on the database you're replicating from (source database)
- Or both on the same database (for simpler local development setups)
If using one database for both the API and ETL source/store objects:
export DATABASE_URL=postgres://USER:PASSWORD@HOST:PORT/DB
# Run all migrations on the same database
cargo x migrateIf using separate databases (recommended for production):
# API migrations on the control plane database
export DATABASE_URL=postgres://USER:PASSWORD@API_HOST:PORT/API_DB
cargo x migrate etl-api
# ETL migrations on the source database
export DATABASE_URL=postgres://USER:PASSWORD@SOURCE_HOST:PORT/SOURCE_DB
cargo x migrate etlThis separation allows you to:
- Scale the control plane independently from replication workloads
- Keep ETL source/store objects close to the source data
- Isolate concerns between infrastructure management and data replication
The project uses SQLx for database migrations. There are two sets of migrations:
Located in crates/etl-api/migrations/, these create the control plane schema (app schema) for managing tenants, sources, destinations, and pipelines.
Running API migrations:
# From project root
cargo x migrate etl-api
# Or manually with SQLx CLI
sqlx migrate run --source crates/etl-api/migrationsCreating a new API migration:
cd crates/etl-api
sqlx migrate add <migration_name>Resetting the API database:
cd crates/etl-api
sqlx migrate revertUpdating SQLx metadata after schema changes:
cd crates/etl-api
cargo sqlx prepareLocated under crates/etl/migrations/, these prepare the source database:
crates/etl/migrations/source/: ETL source helpers required by every pipeline, such as schema snapshot functions and the DDL event trigger.Pipeline::start()runs these automatically.crates/etl/migrations/postgres_store/: Postgres-backed state store tables used to persist replication state, versioned table schemas, and destination metadata.PostgresStore::new()runs these automatically.
Both migration sets write to etl._sqlx_migrations. When running them
separately, always use SQLx's --ignore-missing flag so each migrator validates
its own versions while ignoring versions owned by the other set.
Do not edit an already-applied migration file, including comments or whitespace. SQLx stores a SHA-384 checksum of the full migration contents, so even comment-only changes will break existing databases with a checksum mismatch.
Running ETL migrations manually:
# From project root
cargo x migrate etl
# Or manually with SQLx CLI (requires setting search_path)
psql $DATABASE_URL -c "create schema if not exists etl;"
sqlx migrate run --source crates/etl/migrations/postgres_store --database-url "${DATABASE_URL}?options=-csearch_path%3Detl" --ignore-missing
sqlx migrate run --source crates/etl/migrations/source --database-url "${DATABASE_URL}?options=-csearch_path%3Detl" --ignore-missingReverting ETL migrations manually:
# Revert source migrations.
sqlx migrate revert --source crates/etl/migrations/source --database-url "${DATABASE_URL}?options=-csearch_path%3Detl" --ignore-missing
# Revert Postgres store migrations.
sqlx migrate revert --source crates/etl/migrations/postgres_store --database-url "${DATABASE_URL}?options=-csearch_path%3Detl" --ignore-missingUse --target-version 0 to revert every migration in one migration set. Revert
source and Postgres store migrations separately because ordering is scoped to
the selected migration folder.
Important: Migrations are run automatically at the appropriate runtime
boundary: source migrations when a pipeline starts, and Postgres store
migrations when the Postgres-backed state store is initialized. However, if you
integrate the etl crate directly into your own application and want to prepare
the source database ahead of time, you can also run these migrations manually.
This design decision ensures:
- The standalone replicator binary works out-of-the-box
- Library users have explicit control over when migrations run
- CI/CD pipelines can pre-apply migrations independently
When to run migrations manually:
- Integrating
etlas a library in your own application - Pre-creating the replication state store schema before deployment
- Testing migrations independently
- CI/CD pipelines that separate migration and deployment steps
Creating a new Postgres state store migration:
cd crates/etl
sqlx migrate add -r --source migrations/postgres_store <migration_name>Creating a new ETL source migration:
cd crates/etl
sqlx migrate add -r --source migrations/source <migration_name>Both etl-api and etl-replicator binaries use hierarchical configuration loading from the configuration/ directory within each crate. Configuration is loaded in this order:
- Base configuration:
configuration/base.yaml(always loaded) - Environment-specific:
configuration/{environment}.yaml(e.g.,dev.yaml,prod.yaml) - Environment variable overrides: Prefixed with
APP_(e.g.,APP_DATABASE__URL)
Environment Selection:
The environment is determined by the APP_ENVIRONMENT variable:
- Default:
prod(ifAPP_ENVIRONMENTis not set) - Available:
dev,staging,prod
# Run with dev environment
APP_ENVIRONMENT=dev cargo run
# Run with production environment (default)
cargo run
# Override specific config values
APP_ENVIRONMENT=dev APP_DATABASE__URL=postgres://localhost/mydb cargo runcd crates/etl-api
APP_ENVIRONMENT=dev cargo runThe API loads configuration from crates/etl-api/configuration/{environment}.yaml. See crates/etl-api/README.md for available configuration options.
Docker images are available for the etl-api. You must mount the configuration files and can override settings via environment variables:
docker run \
-v $(pwd)/crates/etl-api/configuration/base.yaml:/app/configuration/base.yaml \
-v $(pwd)/crates/etl-api/configuration/dev.yaml:/app/configuration/dev.yaml \
-e APP_ENVIRONMENT=dev \
-p 8080:8080 \
ramsup/etl-api:latestConfiguration requirements:
- Mount both
base.yamland your environment-specific config file (e.g.,dev.yaml) - Set
APP_ENVIRONMENTto match your mounted environment file - Override specific values using
APP_prefixed environment variables
The etl-api manages replicator deployments on Kubernetes by dynamically creating StatefulSets, Secrets, and ConfigMaps. The etl-api requires Kubernetes, but the etl-replicator binary can run independently without any Kubernetes setup.
Prerequisites:
- OrbStack with Kubernetes enabled (or another local Kubernetes cluster)
kubectlconfigured with theorbstackcontext- Pre-defined Kubernetes resources (see below)
Required Pre-Defined Resources:
The etl-api expects these resources to exist before it can deploy replicators:
- Namespace:
etl-data-plane- Where all replicator pods and related resources are created - ConfigMap:
trusted-root-certs-config- Provides trusted root certificates for TLS connections
These are defined in scripts/ and should be applied before running the API:
kubectl --context orbstack apply -f scripts/etl-data-plane.yaml
kubectl --context orbstack apply -f scripts/trusted-root-certs-config.yamlNote: For the complete list of expected Kubernetes resources and their specifications, refer to the constants and resource creation logic in crates/etl-api/src/k8s/http.rs.
The replicator can run as a standalone binary without Kubernetes.
cd crates/etl-replicator
APP_ENVIRONMENT=dev cargo runThe replicator loads configuration from crates/etl-replicator/configuration/{environment}.yaml.
Docker images are available for the etl-replicator. You must mount the configuration files and can override settings via environment variables:
docker run \
-v $(pwd)/crates/etl-replicator/configuration/base.yaml:/app/configuration/base.yaml \
-v $(pwd)/crates/etl-replicator/configuration/dev.yaml:/app/configuration/dev.yaml \
-e APP_ENVIRONMENT=dev \
etl-replicator:latestConfiguration requirements:
- Mount both
base.yamland your environment-specific config file (e.g.,dev.yaml) - Set
APP_ENVIRONMENTto match your mounted environment file - Override specific values using
APP_prefixed environment variables
Note: While the replicator is typically deployed as a Kubernetes pod managed by the etl-api, it does not require Kubernetes to function. You can run it as a standalone process on any machine with the appropriate configuration.
The project includes comprehensive test suites that require a PostgreSQL database. Tests use environment variables for database configuration to ensure isolation and reproducibility.
All tests that interact with PostgreSQL require the following environment variables to be set:
| Variable | Required | Description |
|---|---|---|
TESTS_DATABASE_HOST |
Yes | PostgreSQL server hostname (e.g., localhost) |
TESTS_DATABASE_PORT |
Yes | PostgreSQL server port (e.g., 5430) |
TESTS_DATABASE_USERNAME |
Yes | Database user (e.g., postgres) |
TESTS_DATABASE_PASSWORD |
No | Database password (optional) |
TESTS_DATABASE_TLS_ENABLED |
No | Require verified TLS for Postgres test clients when set to true |
TESTS_DATABASE_TLS_ROOT_CERT |
No | Path to the trusted root certificate; defaults to target/postgres-tls/root.crt |
Note: Each test creates a unique database with a UUID-based name to ensure test isolation. The test databases are automatically cleaned up after tests complete.
BigQuery destination tests require Google Cloud credentials:
| Variable | Required | Description |
|---|---|---|
TESTS_BIGQUERY_PROJECT_ID |
Yes | GCP project ID for BigQuery |
TESTS_BIGQUERY_SA_KEY_PATH |
Yes | Path to service account JSON key file |
Note: BigQuery tests are only run when the bigquery and test-utils features are enabled. Each test creates a unique dataset with a UUID-based name for isolation.
Iceberg destination tests use local MinIO and Lakekeeper instances. The following services must be running:
- Lakekeeper:
http://localhost:8182(REST catalog) - MinIO:
http://localhost:9010(S3-compatible storage)- Username:
minio-admin - Password:
minio-admin-password
- Username:
Note: Iceberg tests are only run when the iceberg and test-utils features are enabled. These use hardcoded local URLs and do not require environment variables.
ClickHouse destination tests require a reachable ClickHouse HTTP endpoint:
| Variable | Required | Description |
|---|---|---|
TESTS_CLICKHOUSE_URL |
Yes | ClickHouse HTTP URL (for example, http://localhost:8123) |
TESTS_CLICKHOUSE_USER |
Yes | ClickHouse user name (for the local Docker Compose setup, use etl) |
TESTS_CLICKHOUSE_PASSWORD |
No | ClickHouse password; for the local Docker Compose setup, use etl |
Note: ClickHouse tests are only run when the clickhouse and test-utils features are enabled. Each test creates a unique database in ClickHouse and drops it automatically when the test finishes. The Docker Compose setup started by cargo x init is sufficient for these tests.
| Variable | Description |
|---|---|
ENABLE_TRACING=1 |
Enable tracing output during test execution (useful for debugging) |
RUST_LOG |
Control log level (e.g., debug, info, warn, error) |
Example:
# Run tests with debug output
ENABLE_TRACING=1 RUST_LOG=debug cargo test test_name -- --nocaptureThe most reliable way is to set environment variables directly in the test command:
TESTS_DATABASE_HOST=localhost TESTS_DATABASE_PORT=5430 TESTS_DATABASE_USERNAME=postgres TESTS_DATABASE_PASSWORD=postgres cargo test -p etl-apiExport variables in your current shell session, then run tests:
# PostgreSQL test configuration
export TESTS_DATABASE_HOST=localhost
export TESTS_DATABASE_PORT=5430
export TESTS_DATABASE_USERNAME=postgres
export TESTS_DATABASE_PASSWORD=postgres
# Optional when using the local Docker Compose Postgres from cargo x init.
export TESTS_DATABASE_TLS_ENABLED=true
# BigQuery test configuration (optional - only needed for BigQuery tests)
export TESTS_BIGQUERY_PROJECT_ID=your-gcp-project-id
export TESTS_BIGQUERY_SA_KEY_PATH=/path/to/service-account-key.json
# ClickHouse test configuration (optional - only needed for ClickHouse tests)
export TESTS_CLICKHOUSE_URL=http://localhost:8123
export TESTS_CLICKHOUSE_USER=etl
export TESTS_CLICKHOUSE_PASSWORD=etl
# Enable test output (optional)
export ENABLE_TRACING=1
export RUST_LOG=info
# Now run tests
cargo test -p etl-apiCreate a .env.test file and source it:
# .env.test
# PostgreSQL (required for most tests)
TESTS_DATABASE_HOST=localhost
TESTS_DATABASE_PORT=5430
TESTS_DATABASE_USERNAME=postgres
TESTS_DATABASE_PASSWORD=postgres
# BigQuery (optional - only for BigQuery tests)
TESTS_BIGQUERY_PROJECT_ID=your-gcp-project-id
TESTS_BIGQUERY_SA_KEY_PATH=/path/to/service-account-key.json
# ClickHouse (optional - only for ClickHouse tests)
TESTS_CLICKHOUSE_URL=http://localhost:8123
TESTS_CLICKHOUSE_USER=etl
TESTS_CLICKHOUSE_PASSWORD=etl
# Test output (optional)
ENABLE_TRACING=1
RUST_LOG=info# Source the file and run tests
source .env.test
cargo test -p etl-apiImportant: Environment variables must be set in the same command as cargo test, or exported in your current shell session before running tests.
# Run all tests (requires env variables)
TESTS_DATABASE_HOST=localhost TESTS_DATABASE_PORT=5430 TESTS_DATABASE_USERNAME=postgres TESTS_DATABASE_PASSWORD=postgres cargo test
# Run tests for a specific package
TESTS_DATABASE_HOST=localhost TESTS_DATABASE_PORT=5430 TESTS_DATABASE_USERNAME=postgres TESTS_DATABASE_PASSWORD=postgres cargo test -p etl-api
# Run tests for packages with test-utils feature (etl, etl-postgres, etl-destinations)
TESTS_DATABASE_HOST=localhost TESTS_DATABASE_PORT=5430 TESTS_DATABASE_USERNAME=postgres TESTS_DATABASE_PASSWORD=postgres cargo test -p etl --features test-utils
# Run a specific test
TESTS_DATABASE_HOST=localhost TESTS_DATABASE_PORT=5430 TESTS_DATABASE_USERNAME=postgres TESTS_DATABASE_PASSWORD=postgres cargo test -p etl-api --test tenants tenant_can_be_created
# Run tests with tracing output for debugging
TESTS_DATABASE_HOST=localhost TESTS_DATABASE_PORT=5430 TESTS_DATABASE_USERNAME=postgres TESTS_DATABASE_PASSWORD=postgres ENABLE_TRACING=1 RUST_LOG=info cargo test -p etl-api --test tenants tenant_can_be_created -- --nocapture
# Run the ClickHouse destination integration test against the local Docker Compose service
TESTS_DATABASE_HOST=localhost TESTS_DATABASE_PORT=5430 TESTS_DATABASE_USERNAME=postgres TESTS_DATABASE_PASSWORD=postgres TESTS_CLICKHOUSE_URL=http://localhost:8123 TESTS_CLICKHOUSE_USER=etl TESTS_CLICKHOUSE_PASSWORD=etl cargo test -p etl-destinations --features clickhouse,test-utils clickhouse_pipeline -- --nocapturePackages requiring --features test-utils:
etletl-postgresetl-destinations
Packages that don't require feature flags:
etl-apietl-configetl-telemetryetl-replicator
Note: Ensure PostgreSQL is running and accessible at the configured host and port before running tests. The test suite will fail if it cannot connect to the database or if the required environment variables are not set.
If you encounter connection issues:
-
Verify PostgreSQL is running:
docker-compose -f scripts/docker-compose.yaml ps
-
Check the connection:
psql $DATABASE_URL -c "SELECT 1;"
-
Ensure the correct port is used (default: 5430)
If migrations fail:
-
Check if the database exists:
psql $DATABASE_URL -c "\l"
-
Verify SQLx CLI is installed:
sqlx --version
-
Check migration history:
psql $DATABASE_URL -c "SELECT * FROM _sqlx_migrations;"
If Kubernetes resources aren't deploying:
-
Verify context:
kubectl config current-context
-
Check cluster status:
kubectl cluster-info
-
View events:
kubectl get events -n etl-control-plane --sort-by='.lastTimestamp'