Skip to content

Stale DHT Entries Cause Significant Delays After Node Scale-Down #1193

@tripflex

Description

@tripflex

Spegel version

0.6.0

Kubernetes distribution

AKS

Kubernetes version

1.34.1

CNI

Azure CNI

Describe the bug

We're using Spegel for P2P image caching with GPU nodes that auto-scale based on demand. When a GPU node scales down and a new one scales up, the new node experiences significant delays (1-3+ minutes) pulling images that should be served from P2P cache in seconds, ultimately making the caching take as long as direct image pulling.

Environment

  • Kubernetes: AKS (Azure Kubernetes Service) v1.34.1
  • OS: Ubuntu 24.04 on all nodes
  • Containerd: 2.1.6-1
  • Spegel Version: Latest (via Helm chart) 0.6.0
  • CNI: Azure CNI
  • Registry: Azure Container Registry (ACR)

Root Cause Identified

When a node scales down, its Spegel pod terminates, but other Spegel pods retain DHT entries pointing to the dead node's IP for up to 10 minutes (the hardcoded TTL). When a new node joins and queries the DHT for image layers, it receives stale entries pointing to IPs that no longer exist.

Each TCP connection attempt to a dead IP triggers kernel-level TCP retransmit timeouts (~30 seconds) before failing and moving to the next peer.

Evidence from Logs

New GPU node's Spegel logs after scale-up:

{"msg":"request to mirror failed, retrying with next",
 "mirror":"10.244.0.10:5000",
 "err":"dial tcp 10.244.0.10:5000: i/o timeout"}

The IP 10.244.0.10 was the old GPU node that was scaled down after load went down, and when it went a new one went back up due to load, the DHT still had entries pointing to the old one.

Timeline from Our Scale Test

Time Event
T+0s New GPU node's Spegel pod starts
T+27s Bootstrap completes (P2P connectivity established)
T+30s Image pull requests start
T+60s Errors trying to connect to old GPU node IP (i/o timeout)
T+90s Falls back to other nodes or ACR
T+111s image pull complete (should be ~10s via P2P)

Impact

  • GPU image: 1m51s pull time (should be ~10s from P2P cache)

Current Workaround

Restart entire Spegel DaemonSet after any scale-down event to clear DHT state:

kubectl rollout restart daemonset/spegel -n spegel

This is not ideal for production auto-scaling scenarios and creates extra things to manage and deal with, not to mention the time it takes to restart everything, monitoring issues, etc.


Feature Request: Configurable DHT Record TTL

According to the FAQ, Spegel advertises images with a hardcoded 10-minute TTL. For auto-scaling scenarios where nodes frequently come and go, this causes:

  1. Almost always having stale entries in the DHT
  2. ~30 second delays per stale peer (TCP timeout)
  3. Degraded P2P performance that defeats the purpose of caching

Requested Enhancement

Please consider making the DHT record TTL configurable via Helm values:

spegel:
  # DHT record TTL - how long peers advertise content availability
  # Lower values = faster recovery from stale entries after node removal
  # Higher values = less DHT churn in stable clusters
  dhtRecordTTL: "2m"  # Default: 10m

Use Case

  • Kubernetes clusters with node auto-scaling (GPU nodes, spot instances)
  • Environments where nodes are frequently added/removed
  • Desire to trade DHT stability for faster recovery from stale entries

A 2-minute TTL or lower would significantly reduce the window where stale entries cause delays, while still providing reasonable caching for stable clusters.

Alternative Approaches to Consider

  1. Graceful shutdown hook - Have Spegel actively remove its DHT records during pod termination (preStop hook with sufficient terminationGracePeriodSeconds)
  2. Faster peer health checks - Proactively detect and remove unreachable peers from DHT
  3. Connection timeout configuration - Allow configuring HTTP client timeout for mirror requests (separate from kernel TCP timeout)

Happy to provide additional logs or testing if helpful. Thanks for the great project!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions