fix(docker): add resource limits, restart policy, fix OTLP export#5807
Open
eldios wants to merge 15 commits intotestnet_conwayfrom
Open
fix(docker): add resource limits, restart policy, fix OTLP export#5807eldios wants to merge 15 commits intotestnet_conwayfrom
eldios wants to merge 15 commits intotestnet_conwayfrom
Conversation
c5da300 to
0269a72
Compare
…port Add CPU and memory limits to all docker compose services to prevent resource starvation. Limits are configurable via LIMIT_CPUS_* and LIMIT_MEM_* env vars with sane defaults for an 8-core/32GB machine: - scylla: 2 CPU / 10G (auto-configures with --overprovisioned) - shard (x4): 1.5 CPU / 1.5G each - web (caddy): 1 CPU / 1.5G (handles many TLS/gRPC connections) - proxy: 1 CPU / 512M - alloy: 0.5 CPU / 512M - prometheus: 0.5 CPU / 512M - grafana: 0.5 CPU / 512M - watchtower: 0.25 CPU / 256M Fix Alloy OTLP metrics export by removing explicit Content-Type header and gzip compression from the Prometheus OTLP exporter. The otelcol exporter sets these automatically; the explicit values caused HTTP 400 from the remote endpoint.
Add restart: unless-stopped to all long-running services so containers auto-recover from OOM kills and crashes without manual intervention. Increase shard memory default from 1536M to 2560M. The previous limit caused OOM kills (exit 137) under normal cross-chain message processing load, with shards reaching 1GB+ memory usage during operation.
…template Replace containrrr/watchtower:latest with nickfedor/watchtower:1.15.0. The original containrrr/watchtower was archived on 2025-12-17 and is incompatible with Docker 29+. The nickfedor fork is a drop-in replacement that is actively maintained. Improve .env.production.template with sensible example values for all configuration options including observability endpoints and resource limits.
Move Prometheus and Grafana from docker-compose.yml to
docker-compose.local-monitoring.yml. Most validators use Alloy
(docker-compose.alloy.yml) to push metrics to central monitoring
and don't need local Prometheus/Grafana.
Usage for local monitoring:
docker compose -f docker-compose.yml \
-f docker-compose.local-monitoring.yml up -d
Add configurable SCYLLA_SMP env var (default 4) to explicitly set the number of ScyllaDB seastar shards. Without this, ScyllaDB with --overprovisioned ignores Docker CPU cgroup limits and creates one shard per visible host CPU, leading to poor cache utilization and performance degradation. Move Prometheus and Grafana to docker-compose.local-monitoring.yml. Most validators use Alloy for central monitoring and don't need local dashboards.
- Per-shard resource limits (LIMIT_CPUS_SHARD_0..3 / LIMIT_MEM_SHARD_0..3) - Configurable images (SCYLLA_IMAGE, CADDY_IMAGE, WATCHTOWER_IMAGE) - Configurable ports (WEB_HTTP_PORT, WEB_HTTPS_PORT, PROXY_PORT, SCYLLA_PORT) - ScyllaDB tuning (developer-mode, overprovisioned, healthcheck params) - Storage replication factor and watchtower interval - Update deploy-validator.sh to include all new vars in generated .env - Add scripts/upgrade-env.sh for safe .env migration on existing validators - Remove legacy fix-validator-env.sh (.deployment-info migration)
- Make key generation idempotent (skip if server.json exists) - Use .env.production.template instead of duplicated heredoc - Fix operation order: genesis download before key generation - Remove fake ScyllaDB env vars from XFS override (SCYLLA_DIRECT_IO_MODE, SCYLLA_CACHE_SIZE) - Remove dead code: stop_services() never called, cleanup() trap useless - Remove deprecated --remote-image flag - Replace touch side-effect with test -w in validate_xfs_partition - Replace fragile pipe separator in get_git_info with global variables
…toring Backport configurable shard cache sizes from main to testnet_conway: - Add --execution-state-cache-size (default: 5000, upstream: 10000) - Add --block-cache-size (default: 2500, upstream: 5000) Unbounded upstream defaults cause shards to grow to 10+ GB and trigger cgroup OOM kills (7+ kills in 48h on production validator). Add container resource monitoring via cAdvisor: - Add cAdvisor service to docker-compose.alloy.yml - Add Alloy scraping with labelkeep whitelist (name, job, instance only) to prevent OTLP 400 errors from verbose cAdvisor labels - Add "Container Resources" section to Grafana Node Health dashboard (memory usage vs limit, memory %, CPU, gauges with 70%/90% thresholds)
Prevent DNS resolution failures and connection exhaustion under high load by setting explicit file descriptor limits. Default 524288 (512K), configurable via ULIMIT_NOFILE env var.
8e03691 to
f9d5012
Compare
- Resolve remote branch name when git is in detached HEAD state (e.g. checked out at origin/testnet_conway without a local branch) - Fall back to existing genesis.json when download fails instead of exiting, with clear guidance on how to fix the URL
ScyllaDB requires minimum 1 GiB per shard. Without --smp, it creates one shard per available CPU core (16 on current hosts). Previous 10G default gave only ~492 MiB/shard, causing ScyllaDB to refuse to start.
The .env file was created after docker compose up, so compose used hardcoded defaults instead of template values. Move .env generation before start_services so all configured values are picked up.
Caddy's Caddyfile uses {$ACME_EMAIL} for Let's Encrypt registration,
but the variable was not passed in the container environment. This
caused Caddy to fall back to admin@example.com which Let's Encrypt
rejects as invalid, preventing TLS certificate issuance.
Previous defaults (11.25 total CPU) under-utilized 16-core machines. New defaults allocate 14.75 CPU: - ScyllaDB: 2 → 4 CPU (most CPU-intensive component) - Shards: 1.5 → 2 CPU each (handle traffic spikes) - Proxy: 1 → 1.5 CPU (ingress bottleneck)
Allocate ~60.5 GiB across containers (out of 64 GiB), leaving ~3.5 GiB for OS/kernel page cache: - ScyllaDB: 20G → 30G (primary DB benefits from RAM cache + memtables) - Shards: 2.5G → 6G each (observed peak ~10G, 6G provides headroom) - Proxy: 512M → 4G (was saturating at 99% of 512M) - Web (Caddy): 1.5G → 3G (was at 88% with gRPC connection state) - Alloy: 512M → 1G (more buffer for OTLP/log forwarding)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #5806
Related: linera-io/linera-infra#766
Summary
Resource limits & configurability
LIMIT_CPUS_*/LIMIT_MEM_*env vars with sane defaults (8-core/32GB machine)LIMIT_CPUS_SHARD_0..3/LIMIT_MEM_SHARD_0..3) for fine-grained tuningSCYLLA_IMAGE,CADDY_IMAGE,WATCHTOWER_IMAGE)WEB_HTTP_PORT,WEB_HTTPS_PORT,PROXY_PORT,SCYLLA_PORT)SCYLLA_DEVELOPER_MODE,SCYLLA_OVERPROVISIONED, healthcheck params)STORAGE_REPLICATION_FACTORandWATCHTOWER_INTERVALrestart: unless-stoppedto all long-running servicesDocker & observability
nickfedor/watchtower:1.15.0--smpflag (commented out, opt-in for fresh deployments)docker-compose.local-monitoring.ymldeploy-validator.sh refactor (8 fixes)
server.jsonexists).env.production.templateas single source of truth instead of duplicated heredocSCYLLA_DIRECT_IO_MODE,SCYLLA_CACHE_SIZE)stop_services()never called,cleanup()trap useless--remote-imageflagtouchside-effect withtest -winvalidate_xfs_partition|pipe separator inget_git_infowith global variablesMigration tooling
scripts/upgrade-env.shfor safe.envmigration on existing validators (preserves all data)fix-validator-env.shUpgrade path for existing validators
Test plan
Tested live on OVH validator (
root@15.204.31.226.sslip.io):upgrade-env.shtested on production.env: 19 values preserved, 39 new vars added, passwords with=handled correctly