feat: add W&B worker GPU system metrics#1338
Draft
EazyReal wants to merge 1 commit into
Draft
Conversation
In single-controller training the controller may run on a CPU node while
actor/rollout workers run on GPU nodes. With ``wandb.mode=online`` only the
controller's process metrics are recorded — worker GPU utilisation is missing.
This change introduces an opt-in worker-side W&B system metrics client using
W&B shared mode:
- Controller initialises the primary run with ``x_primary=True``.
- Selected worker processes attach a non-primary client with a role/rank label,
``x_primary=False`` and ``x_update_finish_state=False`` so they contribute
system metrics only and never finalise the run.
- Worker init is locked, exception-tolerant, and silently skips when disabled.
- ``stats_logger.wandb.system_metrics`` exposes ``enabled``, ``roles`` and
``gpu_device_ids``; the WandBConfig validator requires ``mode='shared'``.
Key changes:
- areal/api/cli_args.py: WandBSystemMetricsConfig dataclass + WandBConfig wiring
- areal/utils/wandb_system_metrics.py: worker init/finish + configure hook
- areal/utils/stats_logger.py: controller becomes the primary shared-mode client
- areal/utils/logging.py: register WandBSystemMetrics logger color
- areal/infra/rpc/{rpc_server,ray_rpc_server}.py: register worker hooks
- areal/infra/scheduler/{local,slurm,ray}.py: pre-resolve wandb run id
- areal/trainer/*: init StatsLogger before worker /configure
- docs/{en,zh}: metrics_tracking guide + regenerated CLI reference
- tests/test_wandb_system_metrics.py: unit tests for config and init/finish
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
In single-controller training, the controller process initializes the W&B run while actor/rollout/critic workers own the CUDA devices. W&B samples system metrics from initialized W&B clients, so controller-only logging can show controller process memory without worker GPU utilization.
This PR adds opt-in worker-side W&B system metrics using W&B shared mode:
stats_logger.wandb.system_metrics.enableddefaults tofalse, so existing runs are unchanged.stats_logger.wandb.modemust beshared; the controller stays the primary run and workers attach non-primary clients.x_update_finish_state=False, so they contribute system metrics without finalizing the primary run.Why this belongs in AReaL
This cannot be solved cleanly by a launcher-only change. The launcher can select W&B config, but AReaL owns the worker process lifecycle, role/rank identity, RPC setup, and
StatsLoggerrun identity. Keeping this in AReaL lets the feature be role-aware, non-breaking by default, and consistent across local, Slurm, sync RPC, and Ray RPC paths.Related Issue
None. Opening as a draft PR per
CONTRIBUTING.mdso maintainers can review the design before this new feature is merged. Supersedes closed setup attempts #1335, #1336, and #1337.Type of Change
Usage
Prior PRs reviewed
Merged precedents this follows:
StatsLoggerlifecycle integration, tests, and docs. This PR follows that shape but adds no dependency and keeps the new path disabled by default.system_metricsrole/GPU inputs and fails early when worker metrics are enabled withoutwandb.mode: shared.Closed/unmerged precedents this intentionally avoids:
Checklist
pre-commit run --all-files)./docs/build_all.shmainBreaking Change Details (if applicable):
N/A. The feature is disabled by default, and existing W&B
disabled,offline, andonlineconfigurations keep their current behavior.Validation
uv run --only-group dev python -m pre_commit run --all-filesuv run --only-group dev python -m pytest tests/test_wandb_system_metrics.py tests/test_trackio_backend.py./docs/build_all.sh(succeeds; existing docs warnings remain foralgorithms/sequence_packing.mdtoctree inclusion andtutorial/online_proxy.mdOpenClaw README xref)Test plan
safe-to-testCI on the upstream 2-GPU runner.