A comprehensive Infrastructure as Code (IaC) template for managing Airbyte data pipelines using Terraform. This template provides a scalable, multi-environment data ingestion solution with support for various data sources and destinations.
- Overview
- Architecture
- Features
- Prerequisites
- Quick Start
- Configuration
- Data Sources
- Data Destinations
- Environment Management
- Usage Examples
- Troubleshooting
- Contributing
This template automates the deployment and management of Airbyte data ingestion pipelines using Terraform. It supports multiple data sources, destinations, and environments, making it ideal for organizations looking to implement scalable data integration solutions.
- Infrastructure as Code: Version-controlled, reproducible infrastructure
- Multi-Environment Support: Separate dev, staging, and production environments
- Modular Design: Reusable modules for sources, destinations, and connections
- Scalable: Easily add new data sources and destinations
- Secure: Proper secret management and authentication
graph TB
subgraph "Data Sources"
S3C[S3: Comeet Data]
S3D[S3: Tikal Datalake]
end
subgraph "Airbyte Platform"
AB[Airbyte Server]
WS1[Dev Workspace]
WS2[Stage Workspace]
WS3[Prod Workspace]
end
subgraph "Data Destinations"
BQ[BigQuery]
end
subgraph "Infrastructure"
TF[Terraform]
K8S[Kubernetes Backend - Playground]
end
S3C --> AB
S3D --> AB
AB --> WS1
AB --> WS2
AB --> WS3
WS1 --> BQ
WS2 --> BQ
WS3 --> BQ
TF --> K8S
TF --> AB
Component | Purpose | Technology |
---|---|---|
Sources | Data ingestion from various systems | S3 (AWS) |
Destinations | Data storage and processing | BigQuery |
Orchestration | Pipeline management | Airbyte |
Infrastructure | Resource provisioning | Terraform + Kubernetes |
State Management | Infrastructure state | Kubernetes Secret Backend |
- AWS S3: Comeet recruiting data (CSV format) and tikal-datalake documents (unstructured format)
- BigQuery: Data warehouse for analytics and reporting
- Multi-environment workspace management
- Kubernetes-based state backend
- Modular Terraform architecture
- Automated resource naming and tagging
- Secret management integration
- Terraform >= 1.0
- kubectl configured with cluster access
- Access to Airbyte server instance
- Cloud provider credentials (GCP, AWS)
- Kubernetes cluster with "playground" context configured
- Airbyte server with API access
- Google Cloud Platform project with BigQuery API enabled
- AWS account with S3 access for Comeet and tikal-datalake buckets
git clone <repository-url>
cd cne-airbyte-template
cp terraform.tfvars.example terraform.tfvars
Edit terraform.tfvars
with your configuration:
WORKSPACE_ID = "your-airbyte-workspace-id"
USERNAME = "your-airbyte-username"
PASSWORD = "your-airbyte-password"
SERVER_URL = "https://your-airbyte-server.com"
SERVICE_ACCOUNT_INFO = "your-gcp-service-account-json"
BIGQUERY_PROJECT_ID = "your-bigquery-project-id"
# ... additional variables
terraform init -backend-config=backend-config/config.k8s.tfbackend
# For development
terraform workspace select cne-airbyte-template-dev
# For staging
terraform workspace select cne-airbyte-template-stage
# For production
terraform workspace select cne-airbyte-template-prod
terraform plan -var-file=terraform.tfvars
terraform apply -var-file=terraform.tfvars
Variable | Description | Required |
---|---|---|
WORKSPACE_ID |
Airbyte workspace identifier | ✅ |
USERNAME |
Airbyte username | ✅ |
PASSWORD |
Airbyte password | ✅ |
SERVER_URL |
Airbyte server URL | ✅ |
SERVICE_ACCOUNT_INFO |
GCP service account JSON | ✅ |
BIGQUERY_PROJECT_ID |
BigQuery project ID | ✅ |
AWS_ACCESS_KEY_ID |
AWS access key | ✅ |
AWS_SECRET_ACCESS_KEY |
AWS secret key | ✅ |
The template uses Terraform workspaces to manage environments:
# Create new workspace
terraform workspace new cne-airbyte-template-dev
# List workspaces
terraform workspace list
# Switch workspace
terraform workspace select cne-airbyte-template-prod
Currently configured S3 sources:
- comeet_all_candidate: Candidate information from Comeet recruiting platform with detailed CSV schema
- comeet_all_candidate_steps: Candidate workflow steps and process data
- tikal-datalake-dev: Unstructured documents including:
- Employee markdown files from GitLab sync
- Engineering playbooks in HTML format
Primary data warehouse with:
- Custom dataset organization using namespace formatting from
source_table_names.json
- Environment-specific dataset naming
- Standard insert loading method with 15MB buffer size
- Located in
europe-central2
region - Supports both structured CSV and unstructured document data
Each environment has its own:
- Terraform workspace (
cne-airbyte-template-{env}
) - Airbyte workspace
- BigQuery datasets
- Resource naming conventions
Connections are defined per environment in locals.{env}.tf
:
# Example connection configuration
"comeet_all_candidate → ${local.BIGQUERY_NAME_DEV}" = {
source_id = module.s3_source["comeet_all_candidate"].source_id
destination_id = module.bigquery_destination.destination_id
status = "active"
non_breaking_schema_updates_behavior = "ignore"
namespace_definition = "custom_format"
namespace_format = local.namespace_formats["sources_comeet"]
schedule = {
schedule_type = "manual"
cron_expression = ""
}
streams = [
{
sync_mode = "full_refresh_overwrite"
name = "all_candidates"
selected = true
}
]
}
- Add the configuration to
locals.tf
:
"new_s3_source" = {
configuration = {
aws_access_key_id = var.AWS_ACCESS_KEY_ID
aws_secret_access_key = var.AWS_SECRET_ACCESS_KEY
bucket = "your-s3-bucket"
streams = [
{
name = "your_data"
days_to_sync_if_history_is_full = 3
schemaless = false
globs = ["path/to/your/files/*.csv"]
input_schema = "{\"field1\": \"string\", \"field2\": \"number\"}"
validation_policy = "Emit Record"
format = {
"csv_format" = {
# CSV format configuration
}
}
}
]
}
workspace_id = local.workspace_id
}
- Add the connection in
locals.dev.tf
(and other environments):
"new_s3_source → ${local.BIGQUERY_NAME_DEV}" = {
source_id = module.s3_source["new_s3_source"].source_id
destination_id = module.bigquery_destination.destination_id
# ... other configuration
}
- Add namespace format in
source_table_names.json
:
{
"sources_new_s3": "sources_new_s3"
}
# Deploy only BigQuery destination
terraform apply -target=module.bigquery_destination
# Deploy specific S3 source
terraform apply -target=module.s3_source["comeet_all_candidate"]
# Deploy all connections
terraform apply -target=module.connections
# View state
terraform state list
# Import existing resource
terraform import module.bigquery_destination.airbyte_destination_bigquery.destination <resource-id>
# Remove resource from state
terraform state rm module.old_source.airbyte_source_s3.source
Error: failed to create Airbyte client: authentication failed
Solution: Verify USERNAME
, PASSWORD
, and SERVER_URL
in terraform.tfvars
Error: workspace "cne-airbyte-template-dev" does not exist
Solution: Create the workspace first:
terraform workspace new cne-airbyte-template-dev
Error: googleapi: Error 403: Access Denied
Solution: Ensure the service account has BigQuery Data Editor and Job User roles
Error: state lock acquired by another process
Solution: Force unlock (use with caution):
terraform force-unlock <lock-id>
Enable detailed logging:
export TF_LOG=DEBUG
terraform plan -var-file=terraform.tfvars
Validate configuration before applying:
terraform validate
terraform fmt -check
- Fork the repository
- Create a feature branch:
git checkout -b feature/new-source
- Make changes: Add new modules or modify existing ones
- Test changes: Validate with
terraform plan
- Submit a pull request
- Use consistent naming conventions
- Add appropriate comments and documentation
- Follow Terraform best practices
- Test in development environment first
When adding new source or destination modules:
- Create module directory:
sources/new-source/
ordestinations/new-destination/
- Add
main.tf
,variables.tf
, andoutputs.tf
- Update root
main.tf
to include the new module - Add environment-specific configurations in
locals.{env}.tf
- Update documentation
This project is licensed under the MIT License - see the LICENSE file for details.
For support and questions:
- Create an issue in this repository
- Check the CLAUDE.md file for technical details
- Refer to Airbyte documentation
- Consult Terraform documentation
Happy Data Engineering! 🎉