Stale DHT Entries Cause Significant Delays After Node Scale-Down

### Spegel version

0.6.0

### Kubernetes distribution

AKS

### Kubernetes version

1.34.1

### CNI

Azure CNI

### Describe the bug

We're using Spegel for P2P image caching with GPU nodes that auto-scale based on demand. When a GPU node scales down and a new one scales up, the new node experiences significant delays (1-3+ minutes) pulling images that should be served from P2P cache in seconds, ultimately making the caching take as long as direct image pulling.

## Environment

- **Kubernetes**: AKS (Azure Kubernetes Service) v1.34.1
- **OS**: Ubuntu 24.04 on all nodes
- **Containerd**: 2.1.6-1
- **Spegel Version**: Latest (via Helm chart) 0.6.0
- **CNI**: Azure CNI
- **Registry**: Azure Container Registry (ACR)

### Root Cause Identified

When a node scales down, its Spegel pod terminates, but **other Spegel pods retain DHT entries pointing to the dead node's IP for up to 10 minutes** (the hardcoded TTL). When a new node joins and queries the DHT for image layers, it receives stale entries pointing to IPs that no longer exist.

Each TCP connection attempt to a dead IP triggers kernel-level TCP retransmit timeouts (~30 seconds) before failing and moving to the next peer.

### Evidence from Logs

New GPU node's Spegel logs after scale-up:

```json
{"msg":"request to mirror failed, retrying with next",
 "mirror":"10.244.0.10:5000",
 "err":"dial tcp 10.244.0.10:5000: i/o timeout"}
```

The IP `10.244.0.10` was the **old** GPU node that was scaled down after load went down, and when it went a new one went back up due to load, the DHT still had entries pointing to the old one.

### Timeline from Our Scale Test

| Time | Event |
|------|-------|
| T+0s | New GPU node's Spegel pod starts |
| T+27s | Bootstrap completes (P2P connectivity established) |
| T+30s | Image pull requests start |
| T+60s | Errors trying to connect to old GPU node IP (i/o timeout) |
| T+90s | Falls back to other nodes or ACR |
| T+111s | image pull complete (should be ~10s via P2P) |

### Impact

- **GPU image**: 1m51s pull time (should be ~10s from P2P cache)

### Current Workaround

Restart entire Spegel DaemonSet after any scale-down event to clear DHT state:

```bash
kubectl rollout restart daemonset/spegel -n spegel
```

**This is not ideal for production auto-scaling scenarios**  and creates extra things to manage and deal with, not to mention the time it takes to restart everything, monitoring issues, etc.

---

## Feature Request: Configurable DHT Record TTL

According to the FAQ, Spegel advertises images with a **hardcoded 10-minute TTL**. For auto-scaling scenarios where nodes frequently come and go, this causes:

1. Almost always having stale entries in the DHT
2. ~30 second delays per stale peer (TCP timeout)
3. Degraded P2P performance that defeats the purpose of caching

### Requested Enhancement

Please consider making the DHT record TTL configurable via Helm values:

```yaml
spegel:
  # DHT record TTL - how long peers advertise content availability
  # Lower values = faster recovery from stale entries after node removal
  # Higher values = less DHT churn in stable clusters
  dhtRecordTTL: "2m"  # Default: 10m
```

### Use Case

- Kubernetes clusters with node auto-scaling (GPU nodes, spot instances)
- Environments where nodes are frequently added/removed
- Desire to trade DHT stability for faster recovery from stale entries

A 2-minute TTL or lower would significantly reduce the window where stale entries cause delays, while still providing reasonable caching for stable clusters.

### Alternative Approaches to Consider

1. **Graceful shutdown hook** - Have Spegel actively remove its DHT records during pod termination (preStop hook with sufficient terminationGracePeriodSeconds)
2. **Faster peer health checks** - Proactively detect and remove unreachable peers from DHT
3. **Connection timeout configuration** - Allow configuring HTTP client timeout for mirror requests (separate from kernel TCP timeout)

---

Happy to provide additional logs or testing if helpful. Thanks for the great project!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stale DHT Entries Cause Significant Delays After Node Scale-Down #1193

Spegel version

Kubernetes distribution

Kubernetes version

CNI

Describe the bug

Environment

Root Cause Identified

Evidence from Logs

Timeline from Our Scale Test

Impact

Current Workaround

Feature Request: Configurable DHT Record TTL

Requested Enhancement

Use Case

Alternative Approaches to Consider

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Time	Event
T+0s	New GPU node's Spegel pod starts
T+27s	Bootstrap completes (P2P connectivity established)
T+30s	Image pull requests start
T+60s	Errors trying to connect to old GPU node IP (i/o timeout)
T+90s	Falls back to other nodes or ACR
T+111s	image pull complete (should be ~10s via P2P)

Uh oh!

Stale DHT Entries Cause Significant Delays After Node Scale-Down #1193

Description

Spegel version

Kubernetes distribution

Kubernetes version

CNI

Describe the bug

Environment

Root Cause Identified

Evidence from Logs

Timeline from Our Scale Test

Impact

Current Workaround

Feature Request: Configurable DHT Record TTL

Requested Enhancement

Use Case

Alternative Approaches to Consider

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions