This project serves as a comprehensive guide to building an end-to-end data engineering pipeline. It covers each stage from data ingestion to processing and finally to storage, utilizing a robust tech stack that includes Python, Apache Kafka, Apache Zookeeper, and Apache Spark. Everything is containerized using Docker for ease of deployment and scalability.
The project is designed with the following components:
- Data Source: We use
openweather
API to generate weather data for our pipeline. - Apache Kafka and Zookeeper: Used for streaming data to the processing engine.
- Control Center and Schema Registry: Helps in monitoring and schema management of our Kafka streams.
- Apache Spark: For data processing with its master and worker nodes.
- Python
- Apache Kafka
- Apache Zookeeper
- Apache Spark
- Docker
-
Run Docker Compose to spin up the services:
docker-compose up
-
Run data pipeline by jupter Notebook OpenWeatherAPI_Kafka.ipynb
-
Other 2 data pipelines with Spark and Pandas OpenWeatherAPI_Spark.ipynb, OpenWeatherAPI_Pandas.ipynb