Skip to content

feat: Initial scaffold for the v2 microservice operator CLI#1375

Draft
sitabulaixizawaluduo wants to merge 11 commits into
mainfrom
feat/cli-scaffold
Draft

feat: Initial scaffold for the v2 microservice operator CLI#1375
sitabulaixizawaluduo wants to merge 11 commits into
mainfrom
feat/cli-scaffold

Conversation

@sitabulaixizawaluduo
Copy link
Copy Markdown
Collaborator

@sitabulaixizawaluduo sitabulaixizawaluduo commented May 28, 2026

Description

Initial scaffold for the v2 microservice operator CLI:

  • areal console-script with four namespaces (inf, agent, train, weight-update); no verbs implemented yet

Related Issue

Related #1374

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📝 Documentation update
  • ♻️ Refactoring
  • ⚡ Performance improvement
  • ✅ Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • Pre-commit hooks pass (pre-commit run --all-files)
  • Relevant tests pass; new tests added for new functionality
  • Documentation updated (if applicable; built with ./docs/build_all.sh)
  • Branch is up to date with main
  • Self-reviewed via /review-pr command
  • This PR was created by a coding agent via /create-pr
  • This PR is a breaking change

Breaking Change Details (if applicable):

Additional Context


Need help? Check the Contributing Guide or ask in
GitHub Discussions!

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a lightweight, scaffolded areal command-line interface (CLI) for the v2 microservice control plane, supporting subcommands for inf, agent, train, and weight-update. It also adds cross-cutting state management helpers and updates the package configuration to expose the areal console script. A review comment suggests explicitly specifying the UTF-8 encoding when writing state files to prevent platform-dependent encoding issues.

"""
path.parent.mkdir(parents=True, exist_ok=True)
tmp = path.with_suffix(path.suffix + ".tmp")
with open(tmp, "w") as f:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When writing text files, especially those containing JSON data (which is strictly UTF-8 per RFC 8259), it is highly recommended to explicitly specify encoding="utf-8". Otherwise, Python will fall back to the system's default encoding (which may vary across environments and platforms), potentially leading to UnicodeEncodeError or corrupted state files when handling non-ASCII characters.

Suggested change
with open(tmp, "w") as f:
with open(tmp, "w", encoding="utf-8") as f:

Drop the `train` namespace and expose `run / start / stop / ps / status
/ logs` as top-level verbs directly under `areal`. Service-side surfaces
(`inf`, `agent`, `weight-update`) stay namespaced. Asymmetric on purpose:
training is the primary use case and benefits from shorter invocation
(`areal run` vs `areal train run`), while service operator commands are
auxiliary and read better in a namespace.

Each new top-level verb is a parser-only stub (silent `return 0`), matching
the same convention as the existing namespace stubs. State directory for
training jobs is `~/.areal/runs/`.
areal run: foreground training driver invoker. Resolves driver from
--driver flag or yaml `driver:` field; forwards --config + Hydra
overrides verbatim to the driver. Writes RunState to ~/.areal/runs/.

areal inf run: detached v2 inference service launcher. Spawns gateway
+ router as detached subprocesses, polls /health, writes ServiceState
to ~/.areal/inf/. CLI exits after the service is healthy.

Both verbs lazy-import their implementations inside _handle so
`areal --help` / `areal <verb> --help` stay light.
…__init__.py

The lazy `__getattr__` form introduced in d839c21 broke an implicit
cycle-breaker that v1 / production code relied on. Eager loading of
`areal.infra` at `import areal` time pre-populates sys.modules so that
`cli_args.py`'s top-level `from areal.engine.fsdp_utils.attn_impl import
...` chain (infra → engine_api → alloc_mode → cli_args) hits the cache
and never tries to look up symbols on a partially-loaded cli_args.

Without that pre-load, invoking `areal run --driver
examples.math.gsm8k_rl:main` triggers the cycle the moment the driver
imports `from areal.api.cli_args import GRPOConfig`:

    cannot import name 'ParallelStrategy' from partially initialized
    module 'areal.api.alloc_mode' (most likely due to a circular import)

The lightness invariant the lazy form was originally protecting was
removed in a525e34 (test_cli_lightness.py deletion), so the lazy form
no longer earns its keep. Revert to main's eager form.

A real fix (deferring the attn_impl / name_resolve imports in cli_args
itself + lazy infra/__init__) is the proper long-term direction, but
that touches areal/api/cli_args.py and areal/infra/__init__.py and is
out of scope for the CLI scaffold PR.
… --name flag

areal run no longer runs the driver in the foreground. It now spawns the
driver as a detached subprocess via a thin _exec.py wrapper, then exits.
The wrapper:
  - updates RunState.last_heartbeat every 5 s on a background thread
  - converts SIGTERM into SystemExit so cleanup runs
  - writes final status + exit_code on driver exit

Status semantics from RunState + heartbeat + PID liveness:
  status="running" + recent heartbeat + pid alive  → healthy
  status="running" + stale heartbeat + pid alive   → hung
  status="running" + pid dead                       → crashed
  status in {completed, failed, stopped}            → terminal

Other changes:
  - --name removed; the run name is always experiment_name/trial_name
    from the yaml (Hydra overrides experiment_name=X / trial_name=Y
    are honored by re-deriving from overrides).
  - _peek_scheduler_type and the RunState.scheduler_type field deleted;
    CLI does not own scheduler.type metadata (driver does internally).
  - RunState.argv deleted (redundant with config_path + overrides);
    last_heartbeat and exit_code added.
…both modes

K8s Job / Slurm sbatch / any orchestrator that infers task status from
the entry process's lifecycle requires `areal run` to:
  - stay alive while the driver runs
  - exit with the driver's exit code (0 success, non-zero failure)

The previous detached-default behaviour broke this contract: CLI exited
immediately with 0, container engine marked Pod completed, killed the
background driver. Fix: `areal run` now runs the driver in the CLI
process. Detached mode moves to `areal start` (already a top-level verb
stub in the scaffold).

Implementation: extract `run_with_wrapper(name, driver, config, overrides)`
from `_exec.main()`. Both modes call it:
  - `areal run`  → CLI process calls it directly (attached)
  - `areal start` → CLI spawns subprocess running `_exec.main` which
                    in turn calls `run_with_wrapper` (detached)

Heartbeat thread, SIGTERM trap, and final-state write are identical in
both modes — single implementation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant