-
-
Notifications
You must be signed in to change notification settings - Fork 137
Description
Spegel version
0.6.0
Kubernetes distribution
AKS
Kubernetes version
1.34.1
CNI
Azure CNI
Describe the bug
We're using Spegel for P2P image caching with GPU nodes that auto-scale based on demand. When a GPU node scales down and a new one scales up, the new node experiences significant delays (1-3+ minutes) pulling images that should be served from P2P cache in seconds, ultimately making the caching take as long as direct image pulling.
Environment
- Kubernetes: AKS (Azure Kubernetes Service) v1.34.1
- OS: Ubuntu 24.04 on all nodes
- Containerd: 2.1.6-1
- Spegel Version: Latest (via Helm chart) 0.6.0
- CNI: Azure CNI
- Registry: Azure Container Registry (ACR)
Root Cause Identified
When a node scales down, its Spegel pod terminates, but other Spegel pods retain DHT entries pointing to the dead node's IP for up to 10 minutes (the hardcoded TTL). When a new node joins and queries the DHT for image layers, it receives stale entries pointing to IPs that no longer exist.
Each TCP connection attempt to a dead IP triggers kernel-level TCP retransmit timeouts (~30 seconds) before failing and moving to the next peer.
Evidence from Logs
New GPU node's Spegel logs after scale-up:
{"msg":"request to mirror failed, retrying with next",
"mirror":"10.244.0.10:5000",
"err":"dial tcp 10.244.0.10:5000: i/o timeout"}The IP 10.244.0.10 was the old GPU node that was scaled down after load went down, and when it went a new one went back up due to load, the DHT still had entries pointing to the old one.
Timeline from Our Scale Test
| Time | Event |
|---|---|
| T+0s | New GPU node's Spegel pod starts |
| T+27s | Bootstrap completes (P2P connectivity established) |
| T+30s | Image pull requests start |
| T+60s | Errors trying to connect to old GPU node IP (i/o timeout) |
| T+90s | Falls back to other nodes or ACR |
| T+111s | image pull complete (should be ~10s via P2P) |
Impact
- GPU image: 1m51s pull time (should be ~10s from P2P cache)
Current Workaround
Restart entire Spegel DaemonSet after any scale-down event to clear DHT state:
kubectl rollout restart daemonset/spegel -n spegelThis is not ideal for production auto-scaling scenarios and creates extra things to manage and deal with, not to mention the time it takes to restart everything, monitoring issues, etc.
Feature Request: Configurable DHT Record TTL
According to the FAQ, Spegel advertises images with a hardcoded 10-minute TTL. For auto-scaling scenarios where nodes frequently come and go, this causes:
- Almost always having stale entries in the DHT
- ~30 second delays per stale peer (TCP timeout)
- Degraded P2P performance that defeats the purpose of caching
Requested Enhancement
Please consider making the DHT record TTL configurable via Helm values:
spegel:
# DHT record TTL - how long peers advertise content availability
# Lower values = faster recovery from stale entries after node removal
# Higher values = less DHT churn in stable clusters
dhtRecordTTL: "2m" # Default: 10mUse Case
- Kubernetes clusters with node auto-scaling (GPU nodes, spot instances)
- Environments where nodes are frequently added/removed
- Desire to trade DHT stability for faster recovery from stale entries
A 2-minute TTL or lower would significantly reduce the window where stale entries cause delays, while still providing reasonable caching for stable clusters.
Alternative Approaches to Consider
- Graceful shutdown hook - Have Spegel actively remove its DHT records during pod termination (preStop hook with sufficient terminationGracePeriodSeconds)
- Faster peer health checks - Proactively detect and remove unreachable peers from DHT
- Connection timeout configuration - Allow configuring HTTP client timeout for mirror requests (separate from kernel TCP timeout)
Happy to provide additional logs or testing if helpful. Thanks for the great project!