Skip to content

feat: add W&B worker GPU system metrics#1338

Draft
EazyReal wants to merge 1 commit into
areal-project:mainfrom
VmaxAI:vmax/wandb-worker-system-metrics
Draft

feat: add W&B worker GPU system metrics#1338
EazyReal wants to merge 1 commit into
areal-project:mainfrom
VmaxAI:vmax/wandb-worker-system-metrics

Conversation

@EazyReal
Copy link
Copy Markdown

Description

In single-controller training, the controller process initializes the W&B run while actor/rollout/critic workers own the CUDA devices. W&B samples system metrics from initialized W&B clients, so controller-only logging can show controller process memory without worker GPU utilization.

This PR adds opt-in worker-side W&B system metrics using W&B shared mode:

  • stats_logger.wandb.system_metrics.enabled defaults to false, so existing runs are unchanged.
  • When enabled, stats_logger.wandb.mode must be shared; the controller stays the primary run and workers attach non-primary clients.
  • Worker clients are labeled by role/rank and use x_update_finish_state=False, so they contribute system metrics without finalizing the primary run.
  • Worker telemetry init/finish is lock-protected, idempotent, and warning-only on worker W&B init failures.
  • English/Chinese docs and generated CLI references document the opt-in settings.

Why this belongs in AReaL

This cannot be solved cleanly by a launcher-only change. The launcher can select W&B config, but AReaL owns the worker process lifecycle, role/rank identity, RPC setup, and StatsLogger run identity. Keeping this in AReaL lets the feature be role-aware, non-breaking by default, and consistent across local, Slurm, sync RPC, and Ray RPC paths.

Related Issue

None. Opening as a draft PR per CONTRIBUTING.md so maintainers can review the design before this new feature is merged. Supersedes closed setup attempts #1335, #1336, and #1337.

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation update
  • Code refactoring
  • Performance improvement
  • Test coverage improvement

Usage

stats_logger:
  wandb:
    mode: shared
    system_metrics:
      enabled: true
      roles: [actor, rollout]   # null = every configured worker role
      gpu_device_ids: null      # null = W&B samples the worker's visible GPUs

Prior PRs reviewed

Merged precedents this follows:

Closed/unmerged precedents this intentionally avoids:

Checklist

  • I have read the Contributing Guide
  • Pre-commit hooks pass (pre-commit run --all-files)
  • Relevant tests pass; new tests added for new functionality
  • Documentation updated and built with ./docs/build_all.sh
  • Branch is up to date with main
  • Self-reviewed against merged and closed upstream precedents
  • This PR is a breaking change

Breaking Change Details (if applicable):

N/A. The feature is disabled by default, and existing W&B disabled, offline, and online configurations keep their current behavior.

Validation

  • uv run --only-group dev python -m pre_commit run --all-files
  • uv run --only-group dev python -m pytest tests/test_wandb_system_metrics.py tests/test_trackio_backend.py
  • ./docs/build_all.sh (succeeds; existing docs warnings remain for algorithms/sequence_packing.md toctree inclusion and tutorial/online_proxy.md OpenClaw README xref)

Test plan

  • Maintainer safe-to-test CI on the upstream 2-GPU runner.
  • Optional cluster canary: run a shared-mode W&B job and confirm worker GPU utilization appears on the primary run.

In single-controller training the controller may run on a CPU node while
actor/rollout workers run on GPU nodes. With ``wandb.mode=online`` only the
controller's process metrics are recorded — worker GPU utilisation is missing.

This change introduces an opt-in worker-side W&B system metrics client using
W&B shared mode:

- Controller initialises the primary run with ``x_primary=True``.
- Selected worker processes attach a non-primary client with a role/rank label,
  ``x_primary=False`` and ``x_update_finish_state=False`` so they contribute
  system metrics only and never finalise the run.
- Worker init is locked, exception-tolerant, and silently skips when disabled.
- ``stats_logger.wandb.system_metrics`` exposes ``enabled``, ``roles`` and
  ``gpu_device_ids``; the WandBConfig validator requires ``mode='shared'``.

Key changes:
- areal/api/cli_args.py: WandBSystemMetricsConfig dataclass + WandBConfig wiring
- areal/utils/wandb_system_metrics.py: worker init/finish + configure hook
- areal/utils/stats_logger.py: controller becomes the primary shared-mode client
- areal/utils/logging.py: register WandBSystemMetrics logger color
- areal/infra/rpc/{rpc_server,ray_rpc_server}.py: register worker hooks
- areal/infra/scheduler/{local,slurm,ray}.py: pre-resolve wandb run id
- areal/trainer/*: init StatsLogger before worker /configure
- docs/{en,zh}: metrics_tracking guide + regenerated CLI reference
- tests/test_wandb_system_metrics.py: unit tests for config and init/finish
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant