Skip to content

[DMP 2025]: Scalable, Decoupled Data Ingestion Framework for Real-Time and Batch Integration #544

@tanishk2907

Description

@tanishk2907

Ticket Contents

Description

Project Context
MedPlat processes high volumes of data originating from a variety of sources—mobile apps, web portals, legacy systems, and third-party APIs. These data streams come in both real-time and batch formats, leading to growing challenges in integration, consistency, and timely availability for analytics and decision-making.

Currently, MedPlat uses tightly coupled, manually managed data integrations. This creates:

High maintenance overhead

Delays in data availability for dashboards and alerts

Inflexibility to onboard new systems or support real-time use cases

Objective
Design and implement a scalable, loosely coupled data ingestion framework that can efficiently integrate heterogeneous sources and deliver consistent, timely data to storage and analytics pipelines.

The solution should be resilient, modular, and extensible, with support for:

  • Real-time ingestion via Kafka (or alternatives)
  • Batch ingestion via APIs or file uploads
  • A pluggable contract-driven interface for onboarding sources
  • Observability and fault tolerance

Goals & Mid-Point Milestone

Goals

  • Provide a unified data ingestion layer capable of handling both streaming and batch data
  • Reduce system complexity and eliminate hardcoded integrations via a plug-and-play architecture
  • Centralize data flow through a message queue or dispatcher service
  • Enable clean data routing to downstream storage, analytics engines, and alerting systems
  • Design system with fault tolerance, retry logic, and monitoring
  • Create documentation for extending the pipeline to new modules or partner systems

Setup/Installation

No response

Expected Outcome

A centralized, scalable ingestion system that improves reliability, onboarding speed, and downstream analytics

Near real-time or scheduled availability of data for dashboards, alerts, and models

Reduced manual effort in maintaining integrations

Well-documented architecture and data contracts to guide future expansion

Acceptance Criteria

✅ Ingestion layer supports at least 2 batch and 2 real-time sources
✅ Plug-and-play design allows new sources/sinks to be configured easily
✅ Messages/data can be routed to multiple downstream systems (e.g., PostgreSQL, ElasticSearch, S3)
✅ Retry mechanisms, logging, and status monitoring are in place
✅ Architecture diagrams, setup scripts, and integration guides are included in the repo

Implementation Details

Event Broker (Real-time): Apache Kafka / Apache Pulsar / RabbitMQ

Batch Ingestion: Scheduled pull via API, FTP, or manual CSV ingest via CLI

Language: Python / Node.js / Golang

Data Schema: JSON, Avro, or Parquet with versioned contract registry

Observability: Logs via ELK/FluentD, Prometheus metrics for system health

Scalability: Modular microservices or container-based deployment (Docker + Kubernetes optional)

Mockups/Wireframes

Flow diagrams and data contracts to be created collaboratively with MedPlat data engineering team.

Product Name

Unified Data Ingestion Framework

Organisation Name

Bandhu

Domain

⁠Healthcare

Tech Skills Needed

Angular

Mentor(s)

@mvadodariya

Category

Backend

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions