Skip to content

fix(docker): add resource limits, restart policy, fix OTLP export#5807

Open
eldios wants to merge 15 commits intotestnet_conwayfrom
fix/docker-compose-resource-limits
Open

fix(docker): add resource limits, restart policy, fix OTLP export#5807
eldios wants to merge 15 commits intotestnet_conwayfrom
fix/docker-compose-resource-limits

Conversation

@eldios
Copy link
Copy Markdown
Collaborator

@eldios eldios commented Mar 25, 2026

Closes #5806
Related: linera-io/linera-infra#766

Summary

Resource limits & configurability

  • Add configurable CPU/memory limits to all docker compose services via LIMIT_CPUS_* / LIMIT_MEM_* env vars with sane defaults (8-core/32GB machine)
  • Per-shard resource limits (LIMIT_CPUS_SHARD_0..3 / LIMIT_MEM_SHARD_0..3) for fine-grained tuning
  • Configurable Docker images (SCYLLA_IMAGE, CADDY_IMAGE, WATCHTOWER_IMAGE)
  • Configurable ports (WEB_HTTP_PORT, WEB_HTTPS_PORT, PROXY_PORT, SCYLLA_PORT)
  • Configurable ScyllaDB tuning (SCYLLA_DEVELOPER_MODE, SCYLLA_OVERPROVISIONED, healthcheck params)
  • Configurable STORAGE_REPLICATION_FACTOR and WATCHTOWER_INTERVAL
  • Add restart: unless-stopped to all long-running services

Docker & observability

  • Fix Alloy OTLP metrics export by removing explicit Content-Type header and gzip compression that caused HTTP 400
  • Replace deprecated watchtower with nickfedor/watchtower:1.15.0
  • Add ScyllaDB --smp flag (commented out, opt-in for fresh deployments)
  • Extract local monitoring stack to separate docker-compose.local-monitoring.yml

deploy-validator.sh refactor (8 fixes)

  • Make key generation idempotent (skip if server.json exists)
  • Use .env.production.template as single source of truth instead of duplicated heredoc
  • Fix operation order: genesis download before key generation (fail early, no side effects)
  • Remove fake ScyllaDB env vars from XFS override (SCYLLA_DIRECT_IO_MODE, SCYLLA_CACHE_SIZE)
  • Remove dead code: stop_services() never called, cleanup() trap useless
  • Remove deprecated --remote-image flag
  • Replace touch side-effect with test -w in validate_xfs_partition
  • Replace fragile | pipe separator in get_git_info with global variables

Migration tooling

  • Add scripts/upgrade-env.sh for safe .env migration on existing validators (preserves all data)
  • Remove legacy fix-validator-env.sh

Upgrade path for existing validators

# Preview changes (safe, no modifications)
./scripts/upgrade-env.sh --dry-run

# Apply (creates timestamped backup, only modifies .env)
./scripts/upgrade-env.sh

# Review and customize new variables, then restart
cd docker && docker compose up -d

Test plan

Tested live on OVH validator (root@15.204.31.226.sslip.io):

  • Load average dropped from 14.47 to ~6 after applying limits
  • OTLP metrics flowing successfully (1.59M+ points, 0 failures)
  • Shards stable 24h+ with no OOM kills or restarts
  • upgrade-env.sh tested on production .env: 19 values preserved, 39 new vars added, passwords with = handled correctly

@eldios eldios self-assigned this Mar 26, 2026
@eldios eldios force-pushed the fix/docker-compose-resource-limits branch 2 times, most recently from c5da300 to 0269a72 Compare March 26, 2026 00:23
eldios added 9 commits April 16, 2026 03:21
…port

Add CPU and memory limits to all docker compose services to prevent
resource starvation. Limits are configurable via LIMIT_CPUS_* and
LIMIT_MEM_* env vars with sane defaults for an 8-core/32GB machine:

- scylla: 2 CPU / 10G (auto-configures with --overprovisioned)
- shard (x4): 1.5 CPU / 1.5G each
- web (caddy): 1 CPU / 1.5G (handles many TLS/gRPC connections)
- proxy: 1 CPU / 512M
- alloy: 0.5 CPU / 512M
- prometheus: 0.5 CPU / 512M
- grafana: 0.5 CPU / 512M
- watchtower: 0.25 CPU / 256M

Fix Alloy OTLP metrics export by removing explicit Content-Type header
and gzip compression from the Prometheus OTLP exporter. The otelcol
exporter sets these automatically; the explicit values caused HTTP 400
from the remote endpoint.
Add restart: unless-stopped to all long-running services so containers
auto-recover from OOM kills and crashes without manual intervention.

Increase shard memory default from 1536M to 2560M. The previous limit
caused OOM kills (exit 137) under normal cross-chain message processing
load, with shards reaching 1GB+ memory usage during operation.
…template

Replace containrrr/watchtower:latest with nickfedor/watchtower:1.15.0.
The original containrrr/watchtower was archived on 2025-12-17 and is
incompatible with Docker 29+. The nickfedor fork is a drop-in
replacement that is actively maintained.

Improve .env.production.template with sensible example values for all
configuration options including observability endpoints and resource
limits.
Move Prometheus and Grafana from docker-compose.yml to
docker-compose.local-monitoring.yml. Most validators use Alloy
(docker-compose.alloy.yml) to push metrics to central monitoring
and don't need local Prometheus/Grafana.

Usage for local monitoring:
  docker compose -f docker-compose.yml \
    -f docker-compose.local-monitoring.yml up -d
Add configurable SCYLLA_SMP env var (default 4) to explicitly set
the number of ScyllaDB seastar shards. Without this, ScyllaDB with
--overprovisioned ignores Docker CPU cgroup limits and creates one
shard per visible host CPU, leading to poor cache utilization and
performance degradation.

Move Prometheus and Grafana to docker-compose.local-monitoring.yml.
Most validators use Alloy for central monitoring and don't need local
dashboards.
- Per-shard resource limits (LIMIT_CPUS_SHARD_0..3 / LIMIT_MEM_SHARD_0..3)
- Configurable images (SCYLLA_IMAGE, CADDY_IMAGE, WATCHTOWER_IMAGE)
- Configurable ports (WEB_HTTP_PORT, WEB_HTTPS_PORT, PROXY_PORT, SCYLLA_PORT)
- ScyllaDB tuning (developer-mode, overprovisioned, healthcheck params)
- Storage replication factor and watchtower interval
- Update deploy-validator.sh to include all new vars in generated .env
- Add scripts/upgrade-env.sh for safe .env migration on existing validators
- Remove legacy fix-validator-env.sh (.deployment-info migration)
- Make key generation idempotent (skip if server.json exists)
- Use .env.production.template instead of duplicated heredoc
- Fix operation order: genesis download before key generation
- Remove fake ScyllaDB env vars from XFS override (SCYLLA_DIRECT_IO_MODE, SCYLLA_CACHE_SIZE)
- Remove dead code: stop_services() never called, cleanup() trap useless
- Remove deprecated --remote-image flag
- Replace touch side-effect with test -w in validate_xfs_partition
- Replace fragile pipe separator in get_git_info with global variables
…toring

Backport configurable shard cache sizes from main to testnet_conway:
- Add --execution-state-cache-size (default: 5000, upstream: 10000)
- Add --block-cache-size (default: 2500, upstream: 5000)
Unbounded upstream defaults cause shards to grow to 10+ GB and trigger
cgroup OOM kills (7+ kills in 48h on production validator).

Add container resource monitoring via cAdvisor:
- Add cAdvisor service to docker-compose.alloy.yml
- Add Alloy scraping with labelkeep whitelist (name, job, instance only)
  to prevent OTLP 400 errors from verbose cAdvisor labels
- Add "Container Resources" section to Grafana Node Health dashboard
  (memory usage vs limit, memory %, CPU, gauges with 70%/90% thresholds)
Prevent DNS resolution failures and connection exhaustion under high
load by setting explicit file descriptor limits. Default 524288 (512K),
configurable via ULIMIT_NOFILE env var.
@eldios eldios force-pushed the fix/docker-compose-resource-limits branch from 8e03691 to f9d5012 Compare April 16, 2026 01:31
eldios added 6 commits April 16, 2026 03:45
- Resolve remote branch name when git is in detached HEAD state
  (e.g. checked out at origin/testnet_conway without a local branch)
- Fall back to existing genesis.json when download fails instead of
  exiting, with clear guidance on how to fix the URL
ScyllaDB requires minimum 1 GiB per shard. Without --smp, it creates
one shard per available CPU core (16 on current hosts). Previous 10G
default gave only ~492 MiB/shard, causing ScyllaDB to refuse to start.
The .env file was created after docker compose up, so compose used
hardcoded defaults instead of template values. Move .env generation
before start_services so all configured values are picked up.
Caddy's Caddyfile uses {$ACME_EMAIL} for Let's Encrypt registration,
but the variable was not passed in the container environment. This
caused Caddy to fall back to admin@example.com which Let's Encrypt
rejects as invalid, preventing TLS certificate issuance.
Previous defaults (11.25 total CPU) under-utilized 16-core machines.
New defaults allocate 14.75 CPU:
- ScyllaDB: 2 → 4 CPU (most CPU-intensive component)
- Shards: 1.5 → 2 CPU each (handle traffic spikes)
- Proxy: 1 → 1.5 CPU (ingress bottleneck)
Allocate ~60.5 GiB across containers (out of 64 GiB), leaving ~3.5 GiB
for OS/kernel page cache:
- ScyllaDB: 20G → 30G (primary DB benefits from RAM cache + memtables)
- Shards: 2.5G → 6G each (observed peak ~10G, 6G provides headroom)
- Proxy: 512M → 4G (was saturating at 99% of 512M)
- Web (Caddy): 1.5G → 3G (was at 88% with gRPC connection state)
- Alloy: 512M → 1G (more buffer for OTLP/log forwarding)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant