A production-ready distributed task scheduler built in Go, featuring Raft-based leader election, persistent task queues, worker pools, and comprehensive monitoring.
- Leader Election: Raft consensus algorithm ensures only one scheduler distributes tasks
- Distributed Task Queue: Persistent, fault-tolerant task storage with priority support
- Worker Pool Management: Dynamic worker registration, health monitoring, and load balancing
- Task Retry Logic: Exponential backoff with configurable retry policies
- Cron Scheduling: Support for recurring tasks with cron expressions
- Dead Letter Queue: Failed tasks moved to DLQ for analysis
- High Availability: Multi-node cluster with automatic failover
- Horizontal Scalability: Add schedulers and workers dynamically
- Persistent Storage: BadgerDB for local state, Raft log for consensus
- Monitoring: Prometheus metrics, Grafana dashboards, health endpoints, structured logging
- Graceful Shutdown: Clean resource cleanup and task handoff
- RESTful API: Complete API for task management
- Web Dashboard: Real-time monitoring UI with live updates
- Authentication: JWT and API key support with RBAC
- Webhooks: Event-driven notifications for task lifecycle
- Circuit Breakers: Automatic failure detection and recovery
- CLI Tool: Command-line interface for task management
- Workflow Engine: DAG-based task workflows with dependencies
- Multi-Language SDKs: Python, TypeScript/JavaScript clients
- Role-Based Access Control: Fine-grained permissions (admin, operator, viewer)
- Webhook Notifications: HTTP callbacks for task events
- Rate Limiting: Per-user/namespace request throttling
- Audit Logging: Complete audit trail of all operations
- Circuit Breakers: Resilience patterns for external dependencies
- Task Workflows: Complex DAG workflows with visual representation
- Priority-based scheduling (1-10)
- Task dependencies (DAG support)
- Timeout enforcement
- Task cancellation
- Rate limiting per task type
- Multi-tenancy with namespaces
- Custom task metadata and tags
- Cron-style scheduled tasks
- Task templates and composition
- Event-driven task triggers
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Client Applications β
ββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β HTTP REST API
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Scheduler Cluster (Raft) β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β βScheduler1β βScheduler2β βScheduler3β β
β β (Leader) β β(Follower)β β(Follower)β β
β ββββββ¬ββββββ ββββββββββββ ββββββββββββ β
β β Raft Consensus + Task Distribution β
βββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββ
β gRPC
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Worker Pool β
β ββββββββββ ββββββββββ ββββββββββ ββββββββββ β
β βWorker 1β βWorker 2β βWorker 3β βWorker Nβ β
β ββββββββββ ββββββββββ ββββββββββ ββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Monitoring & Storage β
β [Prometheus] [BadgerDB] [Raft Logs] β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Go 1.21+
- Docker & Docker Compose (optional)
# Start a 3-node scheduler cluster
go run cmd/scheduler/main.go --node-id=node1 --http-addr=:8001 --raft-addr=:9001 --grpc-addr=:7001
go run cmd/scheduler/main.go --node-id=node2 --http-addr=:8002 --raft-addr=:9002 --grpc-addr=:7002 --join=localhost:9001
go run cmd/scheduler/main.go --node-id=node3 --http-addr=:8003 --raft-addr=:9003 --grpc-addr=:7003 --join=localhost:9001
# Start workers
go run cmd/worker/main.go --worker-id=worker1 --scheduler=localhost:7001
go run cmd/worker/main.go --worker-id=worker2 --scheduler=localhost:7001docker-compose up -dThis starts:
- 3 scheduler nodes (ports 8001-8003)
- 5 worker nodes
- Prometheus (port 9090)
- Web Dashboard (port 3000)
curl -X POST http://localhost:8001/api/v1/tasks \
-H "Content-Type: application/json" \
-d '{
"name": "data-processing",
"type": "batch",
"priority": 5,
"payload": {
"input_file": "data.csv",
"operation": "aggregate"
},
"timeout": 300,
"max_retries": 3
}'curl -X POST http://localhost:8001/api/v1/tasks \
-H "Content-Type: application/json" \
-d '{
"name": "daily-report",
"type": "report",
"schedule": "0 0 * * *",
"payload": {"report_type": "daily"}
}'curl http://localhost:8001/api/v1/tasks/{task-id}curl http://localhost:8001/api/v1/tasks?status=pending&priority=5Configuration via YAML file or environment variables:
# config/scheduler.yaml
node:
id: "node1"
data_dir: "./data"
http:
addr: ":8001"
grpc:
addr: ":7001"
raft:
addr: ":9001"
bootstrap: true
join_addr: ""
scheduler:
task_timeout: 300s
worker_timeout: 30s
max_retries: 3
storage:
backend: "badger"
path: "./data/tasks"
logging:
level: "info"
format: "json"Available at http://localhost:8001/metrics:
scheduler_tasks_total{status}- Total tasks by statusscheduler_tasks_duration_seconds- Task execution durationscheduler_workers_active- Active worker countscheduler_leader_elections_total- Leader election countscheduler_queue_depth- Current queue depthworker_tasks_processed_total- Tasks processed by worker
# Scheduler health
curl http://localhost:8001/health
# Worker health
curl http://localhost:8101/healthAccess at http://localhost:3000:
- Real-time task statistics
- Worker pool status
- Queue visualization
- Cluster health
.
βββ cmd/
β βββ scheduler/ # Scheduler node binary
β βββ worker/ # Worker node binary
βββ pkg/
β βββ api/ # REST API handlers
β βββ consensus/ # Raft consensus implementation
β βββ scheduler/ # Core scheduling logic
β βββ worker/ # Task execution engine
β βββ storage/ # Persistence layer
β βββ models/ # Data models
β βββ queue/ # Priority queue implementation
β βββ proto/ # gRPC/protobuf definitions
β βββ metrics/ # Prometheus metrics
β βββ logger/ # Structured logging
βββ web/ # Dashboard UI (React)
βββ config/ # Configuration files
βββ deployments/
β βββ docker/ # Dockerfiles
β βββ kubernetes/ # K8s manifests
β βββ docker-compose.yml
βββ tests/ # Integration tests
βββ examples/ # Usage examples
βββ scripts/ # Utility scripts
# Unit tests
go test ./...
# Integration tests
go test -tags=integration ./tests/...
# Load testing
go run examples/load_test.go --tasks=10000 --workers=50kubectl apply -f deployments/kubernetes/This deploys:
- StatefulSet for scheduler cluster (3 replicas)
- Deployment for workers (auto-scaling)
- Services and ingress
- ConfigMaps and secrets
- Configure persistent volumes for scheduler data
- Set up monitoring alerts in Prometheus
- Enable TLS for gRPC and HTTP
- Configure resource limits
- Set up log aggregation
- Enable authentication/authorization
- Configure backup strategy for Raft state
- Set up distributed tracing (optional)
This project teaches:
- Leader Election: Raft consensus ensures one leader coordinates work
- Distributed Consensus: How nodes agree on cluster state
- Task Distribution: Load balancing strategies and worker selection
- Fault Tolerance: Handling node failures and network partitions
- Persistent State: Maintaining consistency across restarts
- Monitoring: Observability in distributed systems
- Graceful Degradation: Circuit breakers and retry logic
Contributions welcome! Please read CONTRIBUTING.md first.
MIT License - see LICENSE file