A comprehensive End-to-End Data Engineering Project demonstrating a scalable, real-time Change Data Capture (CDC) pipeline.
This system simulates a high-velocity retail environment in Italy, capturing transactions from sharded operational databases (PostgreSQL & MySQL), streaming them via Kafka, processing aggregations using Apache Spark Structured Streaming, and serving real-time analytics via Redis.
- Data Generation Layer: A custom Python script (using Faker) generates realistic Italian retail transactions and inserts them directly into operational databases.
- Operational Layer: PostgreSQL and MySQL act as the source-of-truth databases, simulating a sharded environment.
- Ingestion Layer (CDC): Debezium connectors monitor the database logs (WAL & Binlog) and stream row-level changes to Kafka.
- Messaging Layer: Apache Kafka decouples the ingestion from processing, buffering events in specific topics.
- Processing Layer: Apache Spark Structured Streaming consumes the streams, parses complex JSON payloads, and performs stateful aggregations (Sales per Clerk) in real-time.
- Serving Layer: Redis stores the aggregated metrics for low-latency access (Speed Layer).
- Visualization Layer: Grafana (connected to Redis) and Kafka UI provide monitoring dashboards.
├── build/
│ ├── generator/
│ │ ├── Dockerfile # Container setup for Python Data Generator
│ │ └── generateItems.py # Logic for generating Italian retail data
│ └── spark/
│ ├── Dockerfile # Container setup for Spark
│ └── redisSink.py # Spark Structured Streaming Job & Redis logic
├── docker-compose.yaml # Orchestration for 11 microservices
├── start-connectors.sh # Script to initialize Debezium connectors via API
├── .env.example # Template for environment variables (Safe to share)
├── .env # Secrets & Credentials (Ignored by Git)
└── README.md # Project Documentation
| Component | Technology | Description |
|---|---|---|
| Orchestration | Docker Compose | Manages the lifecycle of the entire stack (Network, Volumes). |
| Data Gen | Python 3.9 | Simulates realistic transactions using the Faker library. |
| Databases | Postgres & MySQL | Simulates a polyglot persistence layer (OLTP). |
| CDC | Debezium | Captures database changes without polling. |
| Broker | Kafka & Zookeeper | Handles high-throughput event streaming. |
| Processing | Spark Streaming | Performs stateful aggregations (Count/Sum). |
| Storage | Redis | In-memory NoSQL store for real-time dashboards. |
| Monitoring | Kafka UI | Web UI for managing Kafka clusters and topics. |
Follow these instructions to deploy the pipeline on your local machine or server.
- Docker Engine & Docker Compose installed.
- 16GB RAM recommended for optimal performance (minimum 8GB).
Copy the example environment file to create your local secrets file. This file contains database passwords and configuration.
cp .env.example .env
Note: You can modify
.envto change passwords or ports if necessary.
Build the custom Docker images and start the services in detached mode:
docker compose up --build -d
Wait approximately 1-2 minutes for all containers (especially Kafka and Connect) to fully initialize.
Once the containers are running, register the Debezium connectors to start the CDC process:
./start-connectors.sh
Expected Output: HTTP/1.1 201 Created
Check if the Spark job is successfully writing aggregated data to Redis:
docker exec -it redis redis-cli keys "*"
Sample Output:
Plaintext
1) "Mysql:Alessandro Del Piero"
2) "Postgresql:Giulia Bianchi"
3) "Postgresql:Francesco Totti"
...
Ensure the Python script is inserting data correctly:
docker logs -f generator
- Kafka UI:
http://localhost:9090- Monitor Topics & Consumers. - Grafana:
http://localhost:3000- Visualization (Login with credentials from.env).
Contributions are welcome! Please open an issue first to discuss what you would like to change.
This project is licensed under the MIT License.
