fix(docker): add resource limits, restart policy, fix OTLP export by eldios · Pull Request #5807 · linera-io/linera-protocol

eldios · 2026-03-25T23:42:54Z

Closes #5806
Related: linera-io/linera-infra#766

Summary

Resource limits & configurability

Add configurable CPU/memory limits to all docker compose services via LIMIT_CPUS_* / LIMIT_MEM_* env vars with sane defaults (8-core/32GB machine)
Per-shard resource limits (LIMIT_CPUS_SHARD_0..3 / LIMIT_MEM_SHARD_0..3) for fine-grained tuning
Configurable Docker images (SCYLLA_IMAGE, CADDY_IMAGE, WATCHTOWER_IMAGE)
Configurable ports (WEB_HTTP_PORT, WEB_HTTPS_PORT, PROXY_PORT, SCYLLA_PORT)
Configurable ScyllaDB tuning (SCYLLA_DEVELOPER_MODE, SCYLLA_OVERPROVISIONED, healthcheck params)
Configurable STORAGE_REPLICATION_FACTOR and WATCHTOWER_INTERVAL
Add restart: unless-stopped to all long-running services

Docker & observability

Fix Alloy OTLP metrics export by removing explicit Content-Type header and gzip compression that caused HTTP 400
Replace deprecated watchtower with nickfedor/watchtower:1.15.0
Add ScyllaDB --smp flag (commented out, opt-in for fresh deployments)
Extract local monitoring stack to separate docker-compose.local-monitoring.yml

deploy-validator.sh refactor (8 fixes)

Make key generation idempotent (skip if server.json exists)
Use .env.production.template as single source of truth instead of duplicated heredoc
Fix operation order: genesis download before key generation (fail early, no side effects)
Remove fake ScyllaDB env vars from XFS override (SCYLLA_DIRECT_IO_MODE, SCYLLA_CACHE_SIZE)
Remove dead code: stop_services() never called, cleanup() trap useless
Remove deprecated --remote-image flag
Replace touch side-effect with test -w in validate_xfs_partition
Replace fragile | pipe separator in get_git_info with global variables

Migration tooling

Add scripts/upgrade-env.sh for safe .env migration on existing validators (preserves all data)
Remove legacy fix-validator-env.sh

Upgrade path for existing validators

# Preview changes (safe, no modifications)
./scripts/upgrade-env.sh --dry-run

# Apply (creates timestamped backup, only modifies .env)
./scripts/upgrade-env.sh

# Review and customize new variables, then restart
cd docker && docker compose up -d

Test plan

Tested live on OVH validator (root@15.204.31.226.sslip.io):

Load average dropped from 14.47 to ~6 after applying limits
OTLP metrics flowing successfully (1.59M+ points, 0 failures)
Shards stable 24h+ with no OOM kills or restarts
upgrade-env.sh tested on production .env: 19 values preserved, 39 new vars added, passwords with = handled correctly

…port Add CPU and memory limits to all docker compose services to prevent resource starvation. Limits are configurable via LIMIT_CPUS_* and LIMIT_MEM_* env vars with sane defaults for an 8-core/32GB machine: - scylla: 2 CPU / 10G (auto-configures with --overprovisioned) - shard (x4): 1.5 CPU / 1.5G each - web (caddy): 1 CPU / 1.5G (handles many TLS/gRPC connections) - proxy: 1 CPU / 512M - alloy: 0.5 CPU / 512M - prometheus: 0.5 CPU / 512M - grafana: 0.5 CPU / 512M - watchtower: 0.25 CPU / 256M Fix Alloy OTLP metrics export by removing explicit Content-Type header and gzip compression from the Prometheus OTLP exporter. The otelcol exporter sets these automatically; the explicit values caused HTTP 400 from the remote endpoint.

Add restart: unless-stopped to all long-running services so containers auto-recover from OOM kills and crashes without manual intervention. Increase shard memory default from 1536M to 2560M. The previous limit caused OOM kills (exit 137) under normal cross-chain message processing load, with shards reaching 1GB+ memory usage during operation.

…template Replace containrrr/watchtower:latest with nickfedor/watchtower:1.15.0. The original containrrr/watchtower was archived on 2025-12-17 and is incompatible with Docker 29+. The nickfedor fork is a drop-in replacement that is actively maintained. Improve .env.production.template with sensible example values for all configuration options including observability endpoints and resource limits.

Move Prometheus and Grafana from docker-compose.yml to docker-compose.local-monitoring.yml. Most validators use Alloy (docker-compose.alloy.yml) to push metrics to central monitoring and don't need local Prometheus/Grafana. Usage for local monitoring: docker compose -f docker-compose.yml \ -f docker-compose.local-monitoring.yml up -d

Add configurable SCYLLA_SMP env var (default 4) to explicitly set the number of ScyllaDB seastar shards. Without this, ScyllaDB with --overprovisioned ignores Docker CPU cgroup limits and creates one shard per visible host CPU, leading to poor cache utilization and performance degradation. Move Prometheus and Grafana to docker-compose.local-monitoring.yml. Most validators use Alloy for central monitoring and don't need local dashboards.

- Per-shard resource limits (LIMIT_CPUS_SHARD_0..3 / LIMIT_MEM_SHARD_0..3) - Configurable images (SCYLLA_IMAGE, CADDY_IMAGE, WATCHTOWER_IMAGE) - Configurable ports (WEB_HTTP_PORT, WEB_HTTPS_PORT, PROXY_PORT, SCYLLA_PORT) - ScyllaDB tuning (developer-mode, overprovisioned, healthcheck params) - Storage replication factor and watchtower interval - Update deploy-validator.sh to include all new vars in generated .env - Add scripts/upgrade-env.sh for safe .env migration on existing validators - Remove legacy fix-validator-env.sh (.deployment-info migration)

- Make key generation idempotent (skip if server.json exists) - Use .env.production.template instead of duplicated heredoc - Fix operation order: genesis download before key generation - Remove fake ScyllaDB env vars from XFS override (SCYLLA_DIRECT_IO_MODE, SCYLLA_CACHE_SIZE) - Remove dead code: stop_services() never called, cleanup() trap useless - Remove deprecated --remote-image flag - Replace touch side-effect with test -w in validate_xfs_partition - Replace fragile pipe separator in get_git_info with global variables

…toring Backport configurable shard cache sizes from main to testnet_conway: - Add --execution-state-cache-size (default: 5000, upstream: 10000) - Add --block-cache-size (default: 2500, upstream: 5000) Unbounded upstream defaults cause shards to grow to 10+ GB and trigger cgroup OOM kills (7+ kills in 48h on production validator). Add container resource monitoring via cAdvisor: - Add cAdvisor service to docker-compose.alloy.yml - Add Alloy scraping with labelkeep whitelist (name, job, instance only) to prevent OTLP 400 errors from verbose cAdvisor labels - Add "Container Resources" section to Grafana Node Health dashboard (memory usage vs limit, memory %, CPU, gauges with 70%/90% thresholds)

Prevent DNS resolution failures and connection exhaustion under high load by setting explicit file descriptor limits. Default 524288 (512K), configurable via ULIMIT_NOFILE env var.

- Resolve remote branch name when git is in detached HEAD state (e.g. checked out at origin/testnet_conway without a local branch) - Fall back to existing genesis.json when download fails instead of exiting, with clear guidance on how to fix the URL

ScyllaDB requires minimum 1 GiB per shard. Without --smp, it creates one shard per available CPU core (16 on current hosts). Previous 10G default gave only ~492 MiB/shard, causing ScyllaDB to refuse to start.

The .env file was created after docker compose up, so compose used hardcoded defaults instead of template values. Move .env generation before start_services so all configured values are picked up.

Caddy's Caddyfile uses {$ACME_EMAIL} for Let's Encrypt registration, but the variable was not passed in the container environment. This caused Caddy to fall back to admin@example.com which Let's Encrypt rejects as invalid, preventing TLS certificate issuance.

Previous defaults (11.25 total CPU) under-utilized 16-core machines. New defaults allocate 14.75 CPU: - ScyllaDB: 2 → 4 CPU (most CPU-intensive component) - Shards: 1.5 → 2 CPU each (handle traffic spikes) - Proxy: 1 → 1.5 CPU (ingress bottleneck)

Allocate ~60.5 GiB across containers (out of 64 GiB), leaving ~3.5 GiB for OS/kernel page cache: - ScyllaDB: 20G → 30G (primary DB benefits from RAM cache + memtables) - Shards: 2.5G → 6G each (observed peak ~10G, 6G provides headroom) - Proxy: 512M → 4G (was saturating at 99% of 512M) - Web (Caddy): 1.5G → 3G (was at 88% with gRPC connection state) - Alloy: 512M → 1G (more buffer for OTLP/log forwarding)

eldios mentioned this pull request Mar 26, 2026

fix(docker): add resource limits, restart policy, fix OTLP, update watchtower #5808

Open

eldios self-assigned this Mar 26, 2026

eldios force-pushed the fix/docker-compose-resource-limits branch 2 times, most recently from c5da300 to 0269a72 Compare March 26, 2026 00:23

eldios added 9 commits April 16, 2026 03:21

feat(docker): add configurable ulimits nofile to all services

f9d5012

Prevent DNS resolution failures and connection exhaustion under high load by setting explicit file descriptor limits. Default 524288 (512K), configurable via ULIMIT_NOFILE env var.

eldios force-pushed the fix/docker-compose-resource-limits branch from 8e03691 to f9d5012 Compare April 16, 2026 01:31

eldios added 6 commits April 16, 2026 03:45

fix(docker): increase ScyllaDB memory default to 20G

60b9765

ScyllaDB requires minimum 1 GiB per shard. Without --smp, it creates one shard per available CPU core (16 on current hosts). Previous 10G default gave only ~492 MiB/shard, causing ScyllaDB to refuse to start.

fix(scripts): generate .env before starting docker compose

9e8e75d

The .env file was created after docker compose up, so compose used hardcoded defaults instead of template values. Move .env generation before start_services so all configured values are picked up.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(docker): add resource limits, restart policy, fix OTLP export#5807

fix(docker): add resource limits, restart policy, fix OTLP export#5807
eldios wants to merge 15 commits intotestnet_conwayfrom
fix/docker-compose-resource-limits

eldios commented Mar 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

eldios commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Resource limits & configurability

Docker & observability

deploy-validator.sh refactor (8 fixes)

Migration tooling

Upgrade path for existing validators

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

eldios commented Mar 25, 2026 •

edited

Loading