ctrs-rs new#1031
Closed
samhita-alla wants to merge 30 commits into
Closed
Conversation
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
…troller to match current remote controller Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
* Split out controller into core.rs and lib.rs. core.rs should be pure Rust and not depend on Python at all. * Update cargo.toml with two features - one that turns on pyo3/auto-initialize for building Rust crates (rlib), which allows us to cargo run binary files, and one that turns on pyo3/extension-module, which lets the manylinux image build wheels. * Temporarily adding in local flyte v2 idl - will remove next pr. * Update Phase -> ActionPhase. * Move the test functions in the Python BaseController class into separate test binaries now that we have rlib. (can delete these in the future) * Add an informer cache to support multiple root runs. * Hook up finalize parent action, which exposed the fact that Action::merge_from_submit was not being called, added. * Move controller errors out to a new file. * Update readme for setup docs a bit. --------- Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Merged main into ctrl-rs. This PR is needed to get things working again. --------- Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
* Update informer errors to use its own errors. * Add watch_for_errors. * Store the singular controller worker into a field so it can be raced in watch_for_errors. * Also race the informer error channel. * Some cleanup: remove unused fields from core base controller. * Adding a timeout for now to thread.join in _r_controller's stop. Otherwise sync tasks were hanging. needs further investigation. --------- Signed-off-by: Yee Hing Tong <[email protected]>
- Add `make dev-rs-dist` to run `make build-wheels`, `make dist`, build image, and install `flyte_controller_base` locally - Stop `_submit_loop` in rust controller correctly when running sync task - Remove color code in remote logging (see following image) <img width="1186" height="470" alt="image" src="https://github.com/user-attachments/assets/be0840a8-e77d-46a9-9270-ab79de95bf15" /> --------- Signed-off-by: machichima <[email protected]>
Reorganize module imports across several files to use grouped and brace-formatted use statements for better readability and consistency. Add a Makefile fmt target that enforces use of the nightly toolchain and runs `cargo +nightly fmt --all` to standardize formatting in CI and local development. These changes improve code consistency, maintainability, and make it easier to apply automatic formatting across the repository. Signed-off-by: Sergey Vilgelm <[email protected]>
Add Makefile targets for linting and autofix: - lint: run cargo clippy - lint-fix: run cargo clippy --fix Update CI workflow to run the Rust lint job: - add rs-lint job that checks out code, ensures toolchain, and runs make -C rs_controller lint Fix all clippy warnings. Signed-off-by: Sergey Vilgelm <[email protected]>
Add worker pool for rust controller Signed-off-by: machichima <[email protected]> Signed-off-by: Yee Hing Tong <[email protected]> Co-authored-by: Yee Hing Tong <[email protected]>
* Add an env var `_F_USE_RUST_CONTROLLER` that selects the rust based controller library. This will get recursively added also. Will not expose through a CLI switch yet... too experimental. Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Revert group data change Signed-off-by: Yee Hing Tong <[email protected]>
* Save action to informer action cache upon submit if not found in informer's action cache * `get_action` should call informer get or create, not just get, and return None rather than an error if informer doesn't have the action * Update flyteidl2 dep to match main project. --------- Signed-off-by: Yee Hing Tong <[email protected]>
Handle `SlowDownError` like what we had in python controller --------- Signed-off-by: machichima <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Pull 295 commits worth of upstream changes onto the Rust controller branch so the port can be aligned with the current Python RemoteController. Conflicts resolved: - README.md: keep main's Learn More + License sections, append a cleaned-up "Developing the Rust Core Controller" section (drop typos, drop "(not sure if this is needed)" / "unclear" comments). - src/flyte/_internal/controllers/_local_controller.py: drop the obsolete ControllerProtocol class on this branch in favor of main's canonical Controller Protocol in controllers/__init__.py.
Bring the Rust controller port back into shape after 295 commits of upstream drift, and apply the cleanup that was outstanding from the original WIP commit. Rust side - Bump flyteidl2 to 2.0.13 (matches Python pin in pyproject.toml). Updated Cargo.lock; the types we use are unchanged across this range. - Realign launch_task error handling with PR #1020: * Code::Aborted is now treated as fatal ("Run aborted: ..."). * Code::FailedPrecondition becomes retryable (sharding/limit hint). * Code::InvalidArgument and Code::NotFound stay fatal. * Other codes still retry with backoff via SlowDownError. - Note in Cargo.toml: the unified Actions service path is gated on the flyteidl2 Rust crate exposing it. Until then the Rust controller only implements the legacy QueueService + StateService flow; Python keeps both, and runs that opt into _U_USE_ACTIONS=1 fall back to Python. - Rename Dockerfile.maturinx -> Dockerfile.maturin (typo). - Makefile: replace hardcoded /Users/ytong flyteidl2 mount with an optional FLYTEIDL2_LOCAL env var. - Drop dev-only scratch scripts (test_auth_direct.py, test_auth_simple.py, try_rust_controller.py). Python side - _r_controller.py: align with main's RemoteController contract. * Use TaskCallSequencer from controllers/__init__.py instead of an inline defaultdict. * Wrap input/output conversions in ctx.new_in_driver_literal_conversion(True) so driver-side literal conversion is enabled when in a task context. * Pass task_context=tctx to translate_task_to_wire so the task spec picks up runtime context. * Add Stopwatch instrumentation around submit (parity with main). * Replace flyte.errors.RunAbortedError (does not exist) with flyte.errors.ActionAbortedError. * finalize_parent_action now clears the sequencer state. - controllers/remote/__init__.py: extend create_remote_controller to pick the Rust-backed implementation when _F_USE_RUST_CONTROLLER=1 and flyte_controller_base is importable, with graceful fallback to the Python implementation otherwise. The Rust path is force-disabled when _U_USE_ACTIONS=1 since the Rust side does not yet support the Actions service. - _run.py: forward _F_USE_RUST_CONTROLLER to spawned task containers alongside _U_USE_ACTIONS. Examples - Revert Rust-controller wheel-layer mods from blessed examples (basics/hello, basics/devbox_one, advanced/cancel_tasks, all examples/stress files). Those examples should not bake in machine-local paths; the Rust controller is selected via env var. - examples/advanced/hybrid_mode.py: replace the hardcoded /Users/ytong/.flyte/config-k3d.yaml and the embedded run id with FLYTE_CONFIG / FLYTE_HYBRID_RUN_NAME / FLYTE_HYBRID_RUN_BASE env vars. - Delete examples/advanced/remote_controller.py — it was a developer scratch file (referenced a nonexistent cloudidl module, used the pre-bytes Action.from_task signature, and embedded local paths).
Make the Python SDK's resolved auth config the source of truth for the Rust controller, instead of having the Rust side independently read _UNION_EAGER_API_KEY out of the environment. This is a step toward the full SessionConfig integration that PR #844 introduced for the Python controller. Rust side - Split CoreBaseController::new_with_auth into two: * new_with_api_key(api_key, workers) — primary path; takes the API key as an argument, decodes it the same way as Python's decode_api_key (base64 of endpoint:client_id:client_secret:org). * new_with_auth(workers) — thin backwards-compat wrapper that reads the legacy _UNION_EAGER_API_KEY env var and delegates to new_with_api_key. - BaseController::__new__ (PyO3) gains an api_key keyword: * api_key set -> new_with_api_key * endpoint set only -> new_without_auth (plain channel) * neither -> new_with_auth (env-var fallback) Python side - _r_controller.RemoteController.__new__/__init__ accept api_key and forward it to the Rust base. - create_remote_controller passes api_key (and endpoint) straight into the Rust constructor when _F_USE_RUST_CONTROLLER=1, so the Python and Rust paths see the same auth input. - TLS knobs (insecure_skip_verify, ca_cert_file_path) are still Python-only; the factory logs a warning and ignores them on the Rust path until follow-up Rust TLS work lands.
Example - examples/advanced/rust_controller.py: a 50-way fanout demo gated on _F_USE_RUST_CONTROLLER. Runs on either controller path so users can A/B the same workload, with a FAN_OUT env var to crank it up. The example header documents the env-var selection rules and the one-time wheel install. Python tests (tests/internal/controllers/test_create_remote_controller.py) - test_default_path_selects_python_controller: with no opt-in env vars, the factory builds the Python RemoteController. - test_rust_opt_in_selects_rust_controller: with _F_USE_RUST_CONTROLLER=1 and a stub flyte_controller_base on sys.modules, the Rust-backed RemoteController is returned and the endpoint is forwarded. - test_api_key_is_plumbed_into_rust_constructor: regression for the SessionConfig commit — explicit api_key reaches BaseController.__new__ instead of being silently dropped or read out of the env. - test_use_actions_forces_python_path_even_with_rust_opt_in: locks in the Actions-service-is-Python-only contract. - test_rust_opt_in_falls_back_when_wheel_missing: when the wheel cannot be imported, the factory must warn and fall back, never raise. Rust tests (rs_controller/src/auth/config.rs #[cfg(test)]) - decodes_well_formed_api_key: the format the Python decode_api_key produces (base64 of endpoint:client_id:client_secret:org). - endpoint_is_prefixed_with_https: tonic wants a real URL, not a dns:/// pseudo-scheme. - rejects_payload_with_wrong_number_of_parts: 3 parts -> InvalidFormat(3). - rejects_invalid_base64 - rejects_invalid_utf8 - empty_org_is_accepted: the org field is ignored by the controller; an empty value is not a decode failure. cargo test passes (6/6); pytest passes (5/5).
For _F_USE_RUST_CONTROLLER=1 to actually pick the Rust path on a remote
cluster, child task containers need flyte_controller_base installed.
Add a PythonWheels layer to examples/advanced/rust_controller.py that
discovers the wheel directory portably (no hardcoded local paths) and
falls back to the default image when no wheel is found.
Wheel-dir resolution:
1. FLYTE_RS_CONTROLLER_WHEELS env var (explicit override).
2. ./rs_controller/dist relative to the example file (the location
`make build-wheels` / `make build-wheel-local` produce).
3. None -> use the default debian-base image; the Rust opt-in
becomes a no-op and the run uses the Python controller.
Quickstart in the docstring now reflects the in-tree workflow:
cd rs_controller && make build-wheels
uv pip install --find-links ./rs_controller/dist --no-index \\
--force-reinstall --no-deps flyte_controller_base
_F_USE_RUST_CONTROLLER=1 python examples/advanced/rust_controller.py
A startup banner prints which controller path was selected and whether
the wheel layer was attached, so it's obvious at a glance whether the
opt-in actually took effect.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.