Skip to content

[train][Preemption handling 2/n] Fan out preemption signal to workers#64099

Open
liulehui wants to merge 5 commits into
ray-project:masterfrom
liulehui:preempt-stage2-worker
Open

[train][Preemption handling 2/n] Fan out preemption signal to workers#64099
liulehui wants to merge 5 commits into
ray-project:masterfrom
liulehui:preempt-stage2-worker

Conversation

@liulehui

@liulehui liulehui commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Description

  1. Builds on top of the Stage 1 preemption watcher ([train][Preemption handling 1/n] Add preemption watcher for node-drain observability #63807): when the watcher detects a preemption, it now fans the signal out to the worker actors and stores in PreemptionContext in the TrainContext
  2. added a rpc for mark_preempt in Train worker
  3. watcher fans out preemptionInfo through every worker handle.
  4. follow up PR will add public apis and controller state.

Related issues

#63968

Additional information

  1. added unit tests
  2. ran this script: https://gist.github.com/liulehui/fc9bd4a6fe58b72a85363c0cda619f7e/edit
    logs to ensure information got fan out to workers:
[14:10:49.909 PDT] [drain] node=62e4d889... drained via GCS, deadline=+15s (14:11:04.905 PDT)
(RayTrainWorker pid=99129) Rank 0 received preemption signal (this_worker_preempted=False, preempted_ranks=[1], deadline_ms=1781557864905).
(RayTrainWorker pid=99128) Rank 1 received preemption signal (this_worker_preempted=True, preempted_ranks=[1], deadline_ms=1781557864905).
(PreemptionWatcher pid=99150) PreemptionWatcher: preemption detected — preempted_node_ids=['62e4d8896da4bad4ce970556147a5a927425ab980253cdcd3af74f6e'], preempted_ranks=[1], deadline_ms=1781557864905
(RayTrainWorker pid=99129) [rank 0] PREEMPT signal: this=False ranks=[1] sec_left=13.7
(RayTrainWorker pid=99128) [rank 1] PREEMPT signal: this=True ranks=[1] sec_left=13.7
(RayTrainWorker pid=99128) [rank 1] PREEMPT signal: this=True ranks=[1] sec_left=13.7

liulehui added 3 commits June 15, 2026 09:39
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
@liulehui liulehui requested a review from a team as a code owner June 15, 2026 16:56

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a mechanism to forward detected preemption signals from the PreemptionWatcher to individual worker actors, allowing the training loop to react to node drains. It introduces a thread-shared PreemptionContext on the worker's TrainContext and adds a mark_preempt RPC method to the worker actor to receive and store these signals. The feedback highlights a potential race condition where mark_preempt could be called before TrainContext is initialized, and suggests catching RuntimeError to queue the signal. Additionally, suggestions are provided to make internal dataclass fields private by setting init=False and to add defensive checks against unpopulated distributed contexts in the preemption callback.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread python/ray/train/v2/_internal/execution/worker_group/worker.py
Comment thread python/ray/train/v2/_internal/execution/preemption.py Outdated
Comment thread python/ray/train/v2/_internal/callbacks/preemption_callback.py
Signed-off-by: Lehui Liu <lehui@anyscale.com>
@liulehui liulehui force-pushed the preempt-stage2-worker branch from f27329b to 41588c7 Compare June 15, 2026 18:21

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 41588c7. Configure here.

Comment thread python/ray/train/v2/_internal/callbacks/preemption_callback.py
@ray-gardener ray-gardener Bot added the train Ray Train Related Issue label Jun 15, 2026
Signed-off-by: Lehui Liu <lehui@anyscale.com>

@pseudo-rnd-thoughts pseudo-rnd-thoughts left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Do we have any end to end tests like release tests to check the spot node preemption works fully?

Comment thread python/ray/train/v2/_internal/execution/worker_group/worker.py
error=error,
training_report=training_report,
return_value=return_value,
preemption_info=train_context.preemption_context.get(),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the threading lock, do we know the effect as the number of nodes scale? Does this act like a sync barrier between all nodes or is it async so minimal overhead is expected?

@pseudo-rnd-thoughts pseudo-rnd-thoughts added the go add ONLY when ready to merge, run all tests label Jun 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants