feat: Initial scaffold for the v2 microservice operator CLI#1375
feat: Initial scaffold for the v2 microservice operator CLI#1375sitabulaixizawaluduo wants to merge 11 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a lightweight, scaffolded areal command-line interface (CLI) for the v2 microservice control plane, supporting subcommands for inf, agent, train, and weight-update. It also adds cross-cutting state management helpers and updates the package configuration to expose the areal console script. A review comment suggests explicitly specifying the UTF-8 encoding when writing state files to prevent platform-dependent encoding issues.
| """ | ||
| path.parent.mkdir(parents=True, exist_ok=True) | ||
| tmp = path.with_suffix(path.suffix + ".tmp") | ||
| with open(tmp, "w") as f: |
There was a problem hiding this comment.
When writing text files, especially those containing JSON data (which is strictly UTF-8 per RFC 8259), it is highly recommended to explicitly specify encoding="utf-8". Otherwise, Python will fall back to the system's default encoding (which may vary across environments and platforms), potentially leading to UnicodeEncodeError or corrupted state files when handling non-ASCII characters.
| with open(tmp, "w") as f: | |
| with open(tmp, "w", encoding="utf-8") as f: |
Drop the `train` namespace and expose `run / start / stop / ps / status / logs` as top-level verbs directly under `areal`. Service-side surfaces (`inf`, `agent`, `weight-update`) stay namespaced. Asymmetric on purpose: training is the primary use case and benefits from shorter invocation (`areal run` vs `areal train run`), while service operator commands are auxiliary and read better in a namespace. Each new top-level verb is a parser-only stub (silent `return 0`), matching the same convention as the existing namespace stubs. State directory for training jobs is `~/.areal/runs/`.
areal run: foreground training driver invoker. Resolves driver from --driver flag or yaml `driver:` field; forwards --config + Hydra overrides verbatim to the driver. Writes RunState to ~/.areal/runs/. areal inf run: detached v2 inference service launcher. Spawns gateway + router as detached subprocesses, polls /health, writes ServiceState to ~/.areal/inf/. CLI exits after the service is healthy. Both verbs lazy-import their implementations inside _handle so `areal --help` / `areal <verb> --help` stay light.
…__init__.py The lazy `__getattr__` form introduced in d839c21 broke an implicit cycle-breaker that v1 / production code relied on. Eager loading of `areal.infra` at `import areal` time pre-populates sys.modules so that `cli_args.py`'s top-level `from areal.engine.fsdp_utils.attn_impl import ...` chain (infra → engine_api → alloc_mode → cli_args) hits the cache and never tries to look up symbols on a partially-loaded cli_args. Without that pre-load, invoking `areal run --driver examples.math.gsm8k_rl:main` triggers the cycle the moment the driver imports `from areal.api.cli_args import GRPOConfig`: cannot import name 'ParallelStrategy' from partially initialized module 'areal.api.alloc_mode' (most likely due to a circular import) The lightness invariant the lazy form was originally protecting was removed in a525e34 (test_cli_lightness.py deletion), so the lazy form no longer earns its keep. Revert to main's eager form. A real fix (deferring the attn_impl / name_resolve imports in cli_args itself + lazy infra/__init__) is the proper long-term direction, but that touches areal/api/cli_args.py and areal/infra/__init__.py and is out of scope for the CLI scaffold PR.
… --name flag
areal run no longer runs the driver in the foreground. It now spawns the
driver as a detached subprocess via a thin _exec.py wrapper, then exits.
The wrapper:
- updates RunState.last_heartbeat every 5 s on a background thread
- converts SIGTERM into SystemExit so cleanup runs
- writes final status + exit_code on driver exit
Status semantics from RunState + heartbeat + PID liveness:
status="running" + recent heartbeat + pid alive → healthy
status="running" + stale heartbeat + pid alive → hung
status="running" + pid dead → crashed
status in {completed, failed, stopped} → terminal
Other changes:
- --name removed; the run name is always experiment_name/trial_name
from the yaml (Hydra overrides experiment_name=X / trial_name=Y
are honored by re-deriving from overrides).
- _peek_scheduler_type and the RunState.scheduler_type field deleted;
CLI does not own scheduler.type metadata (driver does internally).
- RunState.argv deleted (redundant with config_path + overrides);
last_heartbeat and exit_code added.
…both modes
K8s Job / Slurm sbatch / any orchestrator that infers task status from
the entry process's lifecycle requires `areal run` to:
- stay alive while the driver runs
- exit with the driver's exit code (0 success, non-zero failure)
The previous detached-default behaviour broke this contract: CLI exited
immediately with 0, container engine marked Pod completed, killed the
background driver. Fix: `areal run` now runs the driver in the CLI
process. Detached mode moves to `areal start` (already a top-level verb
stub in the scaffold).
Implementation: extract `run_with_wrapper(name, driver, config, overrides)`
from `_exec.main()`. Both modes call it:
- `areal run` → CLI process calls it directly (attached)
- `areal start` → CLI spawns subprocess running `_exec.main` which
in turn calls `run_with_wrapper` (detached)
Heartbeat thread, SIGTERM trap, and final-state write are identical in
both modes — single implementation.
Description
Initial scaffold for the v2 microservice operator CLI:
arealconsole-script with four namespaces (inf,agent,train,weight-update); no verbs implemented yetRelated Issue
Related #1374
Type of Change
Checklist
pre-commit run --all-files)./docs/build_all.sh)main/review-prcommand/create-prBreaking Change Details (if applicable):
Additional Context
Need help? Check the Contributing Guide or ask in
GitHub Discussions!