[train][Preemption handling 2/n] Fan out preemption signal to workers by liulehui · Pull Request #64099 · ray-project/ray

liulehui · 2026-06-15T16:56:27Z

Description

Builds on top of the Stage 1 preemption watcher ([train][Preemption handling 1/n] Add preemption watcher for node-drain observability #63807): when the watcher detects a preemption, it now fans the signal out to the worker actors and stores in PreemptionContext in the TrainContext
added a rpc for mark_preempt in Train worker
watcher fans out preemptionInfo through every worker handle.
follow up PR will add public apis and controller state.

Related issues

Additional information

added unit tests
ran this script: https://gist.github.com/liulehui/fc9bd4a6fe58b72a85363c0cda619f7e/edit
logs to ensure information got fan out to workers:

[14:10:49.909 PDT] [drain] node=62e4d889... drained via GCS, deadline=+15s (14:11:04.905 PDT)
(RayTrainWorker pid=99129) Rank 0 received preemption signal (this_worker_preempted=False, preempted_ranks=[1], deadline_ms=1781557864905).
(RayTrainWorker pid=99128) Rank 1 received preemption signal (this_worker_preempted=True, preempted_ranks=[1], deadline_ms=1781557864905).
(PreemptionWatcher pid=99150) PreemptionWatcher: preemption detected — preempted_node_ids=['62e4d8896da4bad4ce970556147a5a927425ab980253cdcd3af74f6e'], preempted_ranks=[1], deadline_ms=1781557864905
(RayTrainWorker pid=99129) [rank 0] PREEMPT signal: this=False ranks=[1] sec_left=13.7
(RayTrainWorker pid=99128) [rank 1] PREEMPT signal: this=True ranks=[1] sec_left=13.7
(RayTrainWorker pid=99128) [rank 1] PREEMPT signal: this=True ranks=[1] sec_left=13.7

Signed-off-by: Lehui Liu <lehui@anyscale.com>

gemini-code-assist

Code Review

This pull request implements a mechanism to forward detected preemption signals from the PreemptionWatcher to individual worker actors, allowing the training loop to react to node drains. It introduces a thread-shared PreemptionContext on the worker's TrainContext and adds a mark_preempt RPC method to the worker actor to receive and store these signals. The feedback highlights a potential race condition where mark_preempt could be called before TrainContext is initialized, and suggests catching RuntimeError to queue the signal. Additionally, suggestions are provided to make internal dataclass fields private by setting init=False and to add defensive checks against unpopulated distributed contexts in the preemption callback.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Signed-off-by: Lehui Liu <lehui@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 41588c7. Configure here.}

Signed-off-by: Lehui Liu <lehui@anyscale.com>

pseudo-rnd-thoughts

Looks good to me. Do we have any end to end tests like release tests to check the spot node preemption works fully?

pseudo-rnd-thoughts · 2026-06-16T08:59:28Z

            error=error,
            training_report=training_report,
            return_value=return_value,
+            preemption_info=train_context.preemption_context.get(),


With the threading lock, do we know the effect as the number of nodes scale? Does this act like a sync barrier between all nodes or is it async so minimal overhead is expected?

liulehui added 3 commits June 15, 2026 09:39

preemption 2/n: fan out preemption info for train workers

7cb7115

Signed-off-by: Lehui Liu <lehui@anyscale.com>

fix unit tests

ab07a76

Signed-off-by: Lehui Liu <lehui@anyscale.com>

fix comments

17cfaf7

Signed-off-by: Lehui Liu <lehui@anyscale.com>

liulehui requested a review from a team as a code owner June 15, 2026 16:56

gemini-code-assist Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread python/ray/train/v2/_internal/execution/worker_group/worker.py

Comment thread python/ray/train/v2/_internal/execution/preemption.py Outdated

Comment thread python/ray/train/v2/_internal/callbacks/preemption_callback.py

address comment

41588c7

Signed-off-by: Lehui Liu <lehui@anyscale.com>

liulehui force-pushed the preempt-stage2-worker branch from f27329b to 41588c7 Compare June 15, 2026 18:21

cursor Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread python/ray/train/v2/_internal/callbacks/preemption_callback.py

ray-gardener Bot added the train Ray Train Related Issue label Jun 15, 2026

add TODO for torchft integration

c9ab6d6

Signed-off-by: Lehui Liu <lehui@anyscale.com>

pseudo-rnd-thoughts approved these changes Jun 16, 2026

View reviewed changes

pseudo-rnd-thoughts added the go add ONLY when ready to merge, run all tests label Jun 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train][Preemption handling 2/n] Fan out preemption signal to workers#64099

[train][Preemption handling 2/n] Fan out preemption signal to workers#64099
liulehui wants to merge 5 commits into
ray-project:masterfrom
liulehui:preempt-stage2-worker

liulehui commented Jun 15, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

pseudo-rnd-thoughts left a comment

Uh oh!

Uh oh!

pseudo-rnd-thoughts Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

liulehui commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Additional information

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pseudo-rnd-thoughts left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pseudo-rnd-thoughts Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

liulehui commented Jun 15, 2026 •

edited

Loading