The IRCTC Real-Time Data Pipeline is a cloud-based data processing system designed to ingest, transform, and store real-time streaming data from IRCTC (Indian Railway Catering and Tourism Corporation). This project leverages Google Cloud Platform (GCP) services such as Pub/Sub, Dataflow (Apache Beam), BigQuery, and Cloud Storage to enable seamless data processing, transformation, and analysis.
- Data Ingestion: Simulated IRCTC Mock Data is published to Google Pub/Sub.
- Data Processing: A Dataflow pipeline (Apache Beam) reads data from Pub/Sub, applies Python UDFs for transformation and fault tolerance.
- Data Storage: The transformed data is stored in Google BigQuery for analytics.
- UDF Registration: User-defined functions (transform_UDF.py) are registered from Google Cloud Storage to BigQuery.
- Google Cloud Pub/Sub → Real-time message streaming
- Google Dataflow (Apache Beam) → Data processing and transformation
- Google BigQuery → Data warehouse for analytics
- Google Cloud Storage → Stores UDF files
- Python → Apache Beam pipeline & UDF implementation
- SQL → Data transformation & querying in BigQuery
- Terraform (Optional) → Infrastructure as Code (IaC) for GCP setup
✔️ Real-time data ingestion using Pub/Sub
✔️ Serverless & scalable processing via Dataflow
✔️ Custom transformations using Python UDFs
✔️ Fault tolerance & error handling
✔️ Data warehousing for analytics using BigQuery
✔️ Optimized SQL queries for analysis and reporting
Column Name | Data Type | Description |
---|---|---|
row_key |
STRING | Unique identifier for each record |
name |
STRING | Passenger's name |
age |
INT64 | Passenger's age |
email |
STRING | Passenger's email address |
join_date |
DATE | Date when the passenger joined |
last_login |
TIMESTAMP | Last login timestamp |
loyalty_points |
INT64 | Loyalty points earned |
account_balance |
FLOAT64 | Account balance in INR |
is_active |
BOOL | Indicates if the account is active |
inserted_at |
TIMESTAMP | Timestamp when the record was inserted |
updated_at |
TIMESTAMP | Last updated timestamp |
loyalty_status |
STRING | Loyalty membership status |
account_age_days |
INT64 | Total days since account creation |
- 📊 Passenger Behavior Analysis: Using real-time & historical data to understand customer trends.
- 🎁 Loyalty Program Management: Enhancing customer engagement through data-driven rewards.
- 🔍 Operational Monitoring: Identifying active/inactive users for improved service efficiency.
- 📈 Trend Analysis: Leveraging BigQuery for actionable business insights.
✅ Integrate Cloud Functions for event-driven triggers.
✅ Implement Dataflow Streaming Mode for real-time analytics.
✅ Optimize BigQuery Queries to enhance cost efficiency and performance.
Sujit Mahapatra
📧 Email | 🔗 LinkedIn
Contributions are welcome! If you’d like to improve the project, feel free to fork the repository and submit a pull request.