DataPipeline-Kafka-OpenWeatherAPI

Realtime Data Streaming Data Engineering Project | OpenWeather API | Kafka + Spark

Introduction

This project serves as a comprehensive guide to building an end-to-end data engineering pipeline. It covers each stage from data ingestion to processing and finally to storage, utilizing a robust tech stack that includes Python, Apache Kafka, Apache Zookeeper, and Apache Spark. Everything is containerized using Docker for ease of deployment and scalability.

System Architecture

The project is designed with the following components:

Data Source: We use openweather API to generate weather data for our pipeline.
Apache Kafka and Zookeeper: Used for streaming data to the processing engine.
Control Center and Schema Registry: Helps in monitoring and schema management of our Kafka streams.
Apache Spark: For data processing with its master and worker nodes.

Technologies

Python
Apache Kafka
Apache Zookeeper
Apache Spark
Docker

Getting Started

Run Docker Compose to spin up the services:
```
docker-compose up
```
Run data pipeline by jupter Notebook OpenWeatherAPI_Kafka.ipynb
Other 2 data pipelines with Spark and Pandas OpenWeatherAPI_Spark.ipynb, OpenWeatherAPI_Pandas.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!