Skip to content

codeforgood-org/distributed-job-scheduler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Distributed Job Scheduler

A production-ready distributed task scheduler built in Go, featuring Raft-based leader election, persistent task queues, worker pools, and comprehensive monitoring.

🌟 Features

Core Capabilities

  • Leader Election: Raft consensus algorithm ensures only one scheduler distributes tasks
  • Distributed Task Queue: Persistent, fault-tolerant task storage with priority support
  • Worker Pool Management: Dynamic worker registration, health monitoring, and load balancing
  • Task Retry Logic: Exponential backoff with configurable retry policies
  • Cron Scheduling: Support for recurring tasks with cron expressions
  • Dead Letter Queue: Failed tasks moved to DLQ for analysis

Production Ready

  • High Availability: Multi-node cluster with automatic failover
  • Horizontal Scalability: Add schedulers and workers dynamically
  • Persistent Storage: BadgerDB for local state, Raft log for consensus
  • Monitoring: Prometheus metrics, Grafana dashboards, health endpoints, structured logging
  • Graceful Shutdown: Clean resource cleanup and task handoff
  • RESTful API: Complete API for task management
  • Web Dashboard: Real-time monitoring UI with live updates
  • Authentication: JWT and API key support with RBAC
  • Webhooks: Event-driven notifications for task lifecycle
  • Circuit Breakers: Automatic failure detection and recovery
  • CLI Tool: Command-line interface for task management

Advanced Features

  • Workflow Engine: DAG-based task workflows with dependencies
  • Multi-Language SDKs: Python, TypeScript/JavaScript clients
  • Role-Based Access Control: Fine-grained permissions (admin, operator, viewer)
  • Webhook Notifications: HTTP callbacks for task events
  • Rate Limiting: Per-user/namespace request throttling
  • Audit Logging: Complete audit trail of all operations
  • Circuit Breakers: Resilience patterns for external dependencies
  • Task Workflows: Complex DAG workflows with visual representation

Task Features

  • Priority-based scheduling (1-10)
  • Task dependencies (DAG support)
  • Timeout enforcement
  • Task cancellation
  • Rate limiting per task type
  • Multi-tenancy with namespaces
  • Custom task metadata and tags
  • Cron-style scheduled tasks
  • Task templates and composition
  • Event-driven task triggers

πŸ“‹ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Client Applications                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 β”‚ HTTP REST API
                 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Scheduler Cluster (Raft)                    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”‚
β”‚  β”‚Scheduler1β”‚  β”‚Scheduler2β”‚  β”‚Scheduler3β”‚              β”‚
β”‚  β”‚ (Leader) β”‚  β”‚(Follower)β”‚  β”‚(Follower)β”‚              β”‚
β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β”‚
β”‚       β”‚ Raft Consensus + Task Distribution              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚ gRPC
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   Worker Pool                            β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”        β”‚
β”‚  β”‚Worker 1β”‚  β”‚Worker 2β”‚  β”‚Worker 3β”‚  β”‚Worker Nβ”‚        β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Monitoring & Storage                        β”‚
β”‚  [Prometheus] [BadgerDB] [Raft Logs]                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸš€ Quick Start

Prerequisites

  • Go 1.21+
  • Docker & Docker Compose (optional)

Run Locally

# Start a 3-node scheduler cluster
go run cmd/scheduler/main.go --node-id=node1 --http-addr=:8001 --raft-addr=:9001 --grpc-addr=:7001

go run cmd/scheduler/main.go --node-id=node2 --http-addr=:8002 --raft-addr=:9002 --grpc-addr=:7002 --join=localhost:9001

go run cmd/scheduler/main.go --node-id=node3 --http-addr=:8003 --raft-addr=:9003 --grpc-addr=:7003 --join=localhost:9001

# Start workers
go run cmd/worker/main.go --worker-id=worker1 --scheduler=localhost:7001
go run cmd/worker/main.go --worker-id=worker2 --scheduler=localhost:7001

Run with Docker Compose

docker-compose up -d

This starts:

  • 3 scheduler nodes (ports 8001-8003)
  • 5 worker nodes
  • Prometheus (port 9090)
  • Web Dashboard (port 3000)

πŸ“ Usage Examples

Submit a Task

curl -X POST http://localhost:8001/api/v1/tasks \
  -H "Content-Type: application/json" \
  -d '{
    "name": "data-processing",
    "type": "batch",
    "priority": 5,
    "payload": {
      "input_file": "data.csv",
      "operation": "aggregate"
    },
    "timeout": 300,
    "max_retries": 3
  }'

Create Recurring Task

curl -X POST http://localhost:8001/api/v1/tasks \
  -H "Content-Type: application/json" \
  -d '{
    "name": "daily-report",
    "type": "report",
    "schedule": "0 0 * * *",
    "payload": {"report_type": "daily"}
  }'

Check Task Status

curl http://localhost:8001/api/v1/tasks/{task-id}

List Tasks

curl http://localhost:8001/api/v1/tasks?status=pending&priority=5

πŸ”§ Configuration

Configuration via YAML file or environment variables:

# config/scheduler.yaml
node:
  id: "node1"
  data_dir: "./data"

http:
  addr: ":8001"

grpc:
  addr: ":7001"

raft:
  addr: ":9001"
  bootstrap: true
  join_addr: ""

scheduler:
  task_timeout: 300s
  worker_timeout: 30s
  max_retries: 3

storage:
  backend: "badger"
  path: "./data/tasks"

logging:
  level: "info"
  format: "json"

πŸ“Š Monitoring

Prometheus Metrics

Available at http://localhost:8001/metrics:

  • scheduler_tasks_total{status} - Total tasks by status
  • scheduler_tasks_duration_seconds - Task execution duration
  • scheduler_workers_active - Active worker count
  • scheduler_leader_elections_total - Leader election count
  • scheduler_queue_depth - Current queue depth
  • worker_tasks_processed_total - Tasks processed by worker

Health Checks

# Scheduler health
curl http://localhost:8001/health

# Worker health
curl http://localhost:8101/health

Web Dashboard

Access at http://localhost:3000:

  • Real-time task statistics
  • Worker pool status
  • Queue visualization
  • Cluster health

πŸ—οΈ Project Structure

.
β”œβ”€β”€ cmd/
β”‚   β”œβ”€β”€ scheduler/          # Scheduler node binary
β”‚   └── worker/             # Worker node binary
β”œβ”€β”€ pkg/
β”‚   β”œβ”€β”€ api/                # REST API handlers
β”‚   β”œβ”€β”€ consensus/          # Raft consensus implementation
β”‚   β”œβ”€β”€ scheduler/          # Core scheduling logic
β”‚   β”œβ”€β”€ worker/             # Task execution engine
β”‚   β”œβ”€β”€ storage/            # Persistence layer
β”‚   β”œβ”€β”€ models/             # Data models
β”‚   β”œβ”€β”€ queue/              # Priority queue implementation
β”‚   β”œβ”€β”€ proto/              # gRPC/protobuf definitions
β”‚   β”œβ”€β”€ metrics/            # Prometheus metrics
β”‚   └── logger/             # Structured logging
β”œβ”€β”€ web/                    # Dashboard UI (React)
β”œβ”€β”€ config/                 # Configuration files
β”œβ”€β”€ deployments/
β”‚   β”œβ”€β”€ docker/             # Dockerfiles
β”‚   β”œβ”€β”€ kubernetes/         # K8s manifests
β”‚   └── docker-compose.yml
β”œβ”€β”€ tests/                  # Integration tests
β”œβ”€β”€ examples/               # Usage examples
└── scripts/                # Utility scripts

πŸ§ͺ Testing

# Unit tests
go test ./...

# Integration tests
go test -tags=integration ./tests/...

# Load testing
go run examples/load_test.go --tasks=10000 --workers=50

🚒 Deployment

Kubernetes

kubectl apply -f deployments/kubernetes/

This deploys:

  • StatefulSet for scheduler cluster (3 replicas)
  • Deployment for workers (auto-scaling)
  • Services and ingress
  • ConfigMaps and secrets

Production Checklist

  • Configure persistent volumes for scheduler data
  • Set up monitoring alerts in Prometheus
  • Enable TLS for gRPC and HTTP
  • Configure resource limits
  • Set up log aggregation
  • Enable authentication/authorization
  • Configure backup strategy for Raft state
  • Set up distributed tracing (optional)

πŸŽ“ Learning Concepts

This project teaches:

  1. Leader Election: Raft consensus ensures one leader coordinates work
  2. Distributed Consensus: How nodes agree on cluster state
  3. Task Distribution: Load balancing strategies and worker selection
  4. Fault Tolerance: Handling node failures and network partitions
  5. Persistent State: Maintaining consistency across restarts
  6. Monitoring: Observability in distributed systems
  7. Graceful Degradation: Circuit breakers and retry logic

🀝 Contributing

Contributions welcome! Please read CONTRIBUTING.md first.

πŸ“„ License

MIT License - see LICENSE file

πŸ”— Resources

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •