Skip to content

ctrs-rs new#1031

Closed
samhita-alla wants to merge 30 commits into
mainfrom
ctrl-rs-new
Closed

ctrs-rs new#1031
samhita-alla wants to merge 30 commits into
mainfrom
ctrl-rs-new

Conversation

@samhita-alla
Copy link
Copy Markdown
Contributor

No description provided.

wild-endeavor and others added 30 commits November 27, 2025 17:06
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
…troller to match current remote controller

Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
* Split out controller into core.rs and lib.rs. core.rs should be pure
Rust and not depend on Python at all.
* Update cargo.toml with two features - one that turns on
pyo3/auto-initialize for building Rust crates (rlib), which allows us to
cargo run binary files, and one that turns on pyo3/extension-module,
which lets the manylinux image build wheels.
* Temporarily adding in local flyte v2 idl - will remove next pr.
* Update Phase -> ActionPhase.
* Move the test functions in the Python BaseController class into
separate test binaries now that we have rlib. (can delete these in the
future)
* Add an informer cache to support multiple root runs.
* Hook up finalize parent action, which exposed the fact that
Action::merge_from_submit was not being called, added.
* Move controller errors out to a new file.
* Update readme for setup docs a bit.

---------

Signed-off-by: Yee Hing Tong <[email protected]>
Merged main into ctrl-rs. This PR is needed to get things working again.

---------

Signed-off-by: Yee Hing Tong <[email protected]>
* Update informer errors to use its own errors.
* Add watch_for_errors.
* Store the singular controller worker into a field so it can be raced
in watch_for_errors.
* Also race the informer error channel.
* Some cleanup: remove unused fields from core base controller.
* Adding a timeout for now to thread.join in _r_controller's stop.
Otherwise sync tasks were hanging. needs further investigation.

---------

Signed-off-by: Yee Hing Tong <[email protected]>
- Add `make dev-rs-dist` to run `make build-wheels`, `make dist`, build
image, and install `flyte_controller_base` locally
- Stop `_submit_loop` in rust controller correctly when running sync
task
- Remove color code in remote logging (see following image)

<img width="1186" height="470" alt="image"
src="https://github.com/user-attachments/assets/be0840a8-e77d-46a9-9270-ab79de95bf15"
/>

---------

Signed-off-by: machichima <[email protected]>
Reorganize module imports across several files to use grouped and
brace-formatted use statements for better readability and consistency.

Add a Makefile fmt target that enforces use of the nightly toolchain
and runs `cargo +nightly fmt --all` to standardize formatting in CI
and local development.

These changes improve code consistency, maintainability, and make it
easier to apply automatic formatting across the repository.

Signed-off-by: Sergey Vilgelm <[email protected]>
Add Makefile targets for linting and autofix:
- lint: run cargo clippy
- lint-fix: run cargo clippy --fix

Update CI workflow to run the Rust lint job:
- add rs-lint job that checks out code, ensures toolchain,
  and runs make -C rs_controller lint

Fix all clippy warnings.

Signed-off-by: Sergey Vilgelm <[email protected]>
Add worker pool for rust controller

Signed-off-by: machichima <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Co-authored-by: Yee Hing Tong <[email protected]>
* Add an env var `_F_USE_RUST_CONTROLLER` that selects the rust based
controller library. This will get recursively added also. Will not expose through a CLI switch yet... too experimental.

Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Revert group data change

Signed-off-by: Yee Hing Tong <[email protected]>
* Save action to informer action cache upon submit if not found in
informer's action cache
* `get_action` should call informer get or create, not just get, and
return None rather than an error if informer doesn't have the action
* Update flyteidl2 dep to match main project.

---------

Signed-off-by: Yee Hing Tong <[email protected]>
Handle `SlowDownError` like what we had in python controller

---------

Signed-off-by: machichima <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Pull 295 commits worth of upstream changes onto the Rust controller branch
so the port can be aligned with the current Python RemoteController.

Conflicts resolved:
- README.md: keep main's Learn More + License sections, append a cleaned-up
  "Developing the Rust Core Controller" section (drop typos, drop
  "(not sure if this is needed)" / "unclear" comments).
- src/flyte/_internal/controllers/_local_controller.py: drop the obsolete
  ControllerProtocol class on this branch in favor of main's canonical
  Controller Protocol in controllers/__init__.py.
Bring the Rust controller port back into shape after 295 commits of
upstream drift, and apply the cleanup that was outstanding from the
original WIP commit.

Rust side
- Bump flyteidl2 to 2.0.13 (matches Python pin in pyproject.toml).
  Updated Cargo.lock; the types we use are unchanged across this range.
- Realign launch_task error handling with PR #1020:
  * Code::Aborted is now treated as fatal ("Run aborted: ...").
  * Code::FailedPrecondition becomes retryable (sharding/limit hint).
  * Code::InvalidArgument and Code::NotFound stay fatal.
  * Other codes still retry with backoff via SlowDownError.
- Note in Cargo.toml: the unified Actions service path is gated on the
  flyteidl2 Rust crate exposing it. Until then the Rust controller only
  implements the legacy QueueService + StateService flow; Python keeps
  both, and runs that opt into _U_USE_ACTIONS=1 fall back to Python.
- Rename Dockerfile.maturinx -> Dockerfile.maturin (typo).
- Makefile: replace hardcoded /Users/ytong flyteidl2 mount with an
  optional FLYTEIDL2_LOCAL env var.
- Drop dev-only scratch scripts (test_auth_direct.py, test_auth_simple.py,
  try_rust_controller.py).

Python side
- _r_controller.py: align with main's RemoteController contract.
  * Use TaskCallSequencer from controllers/__init__.py instead of an
    inline defaultdict.
  * Wrap input/output conversions in
    ctx.new_in_driver_literal_conversion(True) so driver-side literal
    conversion is enabled when in a task context.
  * Pass task_context=tctx to translate_task_to_wire so the task spec
    picks up runtime context.
  * Add Stopwatch instrumentation around submit (parity with main).
  * Replace flyte.errors.RunAbortedError (does not exist) with
    flyte.errors.ActionAbortedError.
  * finalize_parent_action now clears the sequencer state.
- controllers/remote/__init__.py: extend create_remote_controller to
  pick the Rust-backed implementation when _F_USE_RUST_CONTROLLER=1
  and flyte_controller_base is importable, with graceful fallback to
  the Python implementation otherwise. The Rust path is force-disabled
  when _U_USE_ACTIONS=1 since the Rust side does not yet support the
  Actions service.
- _run.py: forward _F_USE_RUST_CONTROLLER to spawned task containers
  alongside _U_USE_ACTIONS.

Examples
- Revert Rust-controller wheel-layer mods from blessed examples
  (basics/hello, basics/devbox_one, advanced/cancel_tasks, all
  examples/stress files). Those examples should not bake in
  machine-local paths; the Rust controller is selected via env var.
- examples/advanced/hybrid_mode.py: replace the hardcoded
  /Users/ytong/.flyte/config-k3d.yaml and the embedded run id with
  FLYTE_CONFIG / FLYTE_HYBRID_RUN_NAME / FLYTE_HYBRID_RUN_BASE env vars.
- Delete examples/advanced/remote_controller.py — it was a developer
  scratch file (referenced a nonexistent cloudidl module, used the
  pre-bytes Action.from_task signature, and embedded local paths).
Make the Python SDK's resolved auth config the source of truth for the
Rust controller, instead of having the Rust side independently read
_UNION_EAGER_API_KEY out of the environment. This is a step toward the
full SessionConfig integration that PR #844 introduced for the Python
controller.

Rust side
- Split CoreBaseController::new_with_auth into two:
  * new_with_api_key(api_key, workers) — primary path; takes the API
    key as an argument, decodes it the same way as Python's
    decode_api_key (base64 of endpoint:client_id:client_secret:org).
  * new_with_auth(workers) — thin backwards-compat wrapper that reads
    the legacy _UNION_EAGER_API_KEY env var and delegates to
    new_with_api_key.
- BaseController::__new__ (PyO3) gains an api_key keyword:
  * api_key set        -> new_with_api_key
  * endpoint set only  -> new_without_auth (plain channel)
  * neither            -> new_with_auth (env-var fallback)

Python side
- _r_controller.RemoteController.__new__/__init__ accept api_key and
  forward it to the Rust base.
- create_remote_controller passes api_key (and endpoint) straight into
  the Rust constructor when _F_USE_RUST_CONTROLLER=1, so the Python
  and Rust paths see the same auth input.
- TLS knobs (insecure_skip_verify, ca_cert_file_path) are still
  Python-only; the factory logs a warning and ignores them on the
  Rust path until follow-up Rust TLS work lands.
Example
- examples/advanced/rust_controller.py: a 50-way fanout demo gated on
  _F_USE_RUST_CONTROLLER. Runs on either controller path so users can
  A/B the same workload, with a FAN_OUT env var to crank it up.
  The example header documents the env-var selection rules and the
  one-time wheel install.

Python tests (tests/internal/controllers/test_create_remote_controller.py)
- test_default_path_selects_python_controller: with no opt-in env vars,
  the factory builds the Python RemoteController.
- test_rust_opt_in_selects_rust_controller: with _F_USE_RUST_CONTROLLER=1
  and a stub flyte_controller_base on sys.modules, the Rust-backed
  RemoteController is returned and the endpoint is forwarded.
- test_api_key_is_plumbed_into_rust_constructor: regression for the
  SessionConfig commit — explicit api_key reaches BaseController.__new__
  instead of being silently dropped or read out of the env.
- test_use_actions_forces_python_path_even_with_rust_opt_in: locks in
  the Actions-service-is-Python-only contract.
- test_rust_opt_in_falls_back_when_wheel_missing: when the wheel cannot
  be imported, the factory must warn and fall back, never raise.

Rust tests (rs_controller/src/auth/config.rs #[cfg(test)])
- decodes_well_formed_api_key: the format the Python decode_api_key
  produces (base64 of endpoint:client_id:client_secret:org).
- endpoint_is_prefixed_with_https: tonic wants a real URL, not a
  dns:/// pseudo-scheme.
- rejects_payload_with_wrong_number_of_parts: 3 parts -> InvalidFormat(3).
- rejects_invalid_base64
- rejects_invalid_utf8
- empty_org_is_accepted: the org field is ignored by the controller;
  an empty value is not a decode failure.

cargo test passes (6/6); pytest passes (5/5).
For _F_USE_RUST_CONTROLLER=1 to actually pick the Rust path on a remote
cluster, child task containers need flyte_controller_base installed.
Add a PythonWheels layer to examples/advanced/rust_controller.py that
discovers the wheel directory portably (no hardcoded local paths) and
falls back to the default image when no wheel is found.

Wheel-dir resolution:
  1. FLYTE_RS_CONTROLLER_WHEELS env var (explicit override).
  2. ./rs_controller/dist relative to the example file (the location
     `make build-wheels` / `make build-wheel-local` produce).
  3. None -> use the default debian-base image; the Rust opt-in
     becomes a no-op and the run uses the Python controller.

Quickstart in the docstring now reflects the in-tree workflow:
  cd rs_controller && make build-wheels
  uv pip install --find-links ./rs_controller/dist --no-index \\
      --force-reinstall --no-deps flyte_controller_base
  _F_USE_RUST_CONTROLLER=1 python examples/advanced/rust_controller.py

A startup banner prints which controller path was selected and whether
the wheel layer was attached, so it's obvious at a glance whether the
opt-in actually took effect.
@samhita-alla samhita-alla marked this pull request as draft May 6, 2026 06:12
Base automatically changed from ctrl-rs to main May 8, 2026 19:55
@kumare3 kumare3 closed this May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants