Skip to content

Conversation

rmccorm4
Copy link
Contributor

@rmccorm4 rmccorm4 commented Oct 3, 2025

Overview:

Currently the doc test for task tracker is flaky, if you run this in a loop you should see occasional failures:

#!/bin/bash

iteration=1
while true; do
    echo "=== Iteration $iteration ==="
    
    cargo test --locked --doc
    
    if [ $? -ne 0 ]; then
        echo ""
        echo "❌ FAILURE: Doc tests failed on iteration $iteration"
        exit 1
    fi
    
    echo "✓ Iteration $iteration passed"
    echo ""
    
    ((iteration++))
done

This PR aims to resolve the flakiness for more reliable CI.

Passes 100/100 time for me after this change.

Details

Example failure: https://github.com/ai-dynamo/dynamo/actions/runs/18143233794/job/51639015836

failures:

---- lib/runtime/src/utils/tasks/tracker.rs - utils::tasks::tracker::TaskHandle<T>::cancellation_token (line 497) stdout ----
Test executable failed (exit status: 101).

stderr:

thread 'main' panicked at /tmp/rustdoctestxMyMdq/doctest_bundle_2024.rs:597:1:
assertion failed: result.is_err()
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Summary by CodeRabbit

  • Bug Fixes

    • Unified task cancellation handling to consistently use each task’s own token, improving predictability and reducing missed cancellations during retries and continuations.
  • Documentation

    • Clarified examples on task cancellation, including guidance on using cancellable spawning for automatic cancellation.

@rmccorm4
Copy link
Contributor Author

rmccorm4 commented Oct 4, 2025

@coderabbitai review

Copy link
Contributor

coderabbitai bot commented Oct 4, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

1 similar comment
Copy link
Contributor

coderabbitai bot commented Oct 4, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

coderabbitai bot commented Oct 4, 2025

Walkthrough

Updates task execution to pass each task’s own cancellation token through all execution and continuation paths, replacing prior usage of child tokens from an inner token. Documentation gains a note clarifying spawn_cancellable behavior. No public APIs changed.

Changes

Cohort / File(s) Summary of Changes
Cancellation token propagation
lib/runtime/src/utils/tasks/tracker.rs
Replaced inner.cancel_token.child_token() with the task’s task_cancellation_token.clone() across regular and cancellable executor paths, including continuations within the retry loop; retained control flow and error handling; added doc note in TaskHandle::cancellation_token example referencing spawn_cancellable.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Caller
  participant Tracker
  participant Executor as TaskExecutor
  participant Task
  participant Cont as Continuation
  participant CT as CancellationToken

  Caller->>Tracker: spawn task
  Tracker->>CT: obtain task-specific token
  loop retry until completion
    Tracker->>Executor: execute(Task, CT)
    alt success
      Executor-->>Tracker: completed
    else continuation
      Executor-->>Tracker: get continuation
      Tracker->>Cont: execute(CT)
      Cont-->>Tracker: result/next
    else error/retry
      Executor-->>Tracker: error
    end
  end
  Tracker-->>Caller: Task result/handle

  note over Tracker,Executor: CT is the task’s own token (no child token)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

A hare taps keys at twilight’s brink,
Swaps child for task—one token link.
Retries hop on, continuations queue,
Cancellable burrows, tidy and true.
Docs nibble notes for those who spawn—
A gentle fix, then off I yawn. 🐇✨

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The description includes an overview of the flakiness issue and an example failure, but it omits required sections from the repository template such as “Where should the reviewer start?” and “Related Issues,” and the Details section does not describe the actual code changes made to resolve the flakiness. Please add a “#### Where should the reviewer start?” section pointing to the key files (e.g., lib/runtime/src/utils/tasks/tracker.rs) and a “#### Related Issues” section with any issue links or action keywords, and expand the Details section to summarize the actual code changes introduced to fix the flaky doc test.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title clearly indicates that the pull request addresses a flaky documentation test in the task tracker, which aligns with the author’s primary goal of fixing test flakiness.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e1ed700 and b91824a.

📒 Files selected for processing (1)
  • lib/runtime/src/utils/tasks/tracker.rs (2 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@rmccorm4
Copy link
Contributor Author

rmccorm4 commented Oct 4, 2025

@coderabbitai what do you think of this fix? Is it correct, or is there a better or simpler alternative? Is the fix in the spirit of the original doctest? Review thoroughly.

Copy link
Contributor

coderabbitai bot commented Oct 4, 2025

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Get the PR diff to see what changed
gh pr diff 3414 --repo ai-dynamo/dynamo

Length of output: 2685


🏁 Script executed:

#!/bin/bash
# Read the current state of the file to see the doctest
cat lib/runtime/src/utils/tasks/tracker.rs

Length of output: 247471


🏁 Script executed:

#!/bin/bash
# Search for any related test files or documentation about task tracker
fd -e rs -e md . lib/runtime/src/utils/tasks/ --exec cat {}

Length of output: 271477


@rmccorm4

Short answer

  • Your fix is correct, materially improves correctness (not just the doctest), and is in the spirit of the original intent: “cancel this specific task via the TaskHandle’s token.”
  • The doctest change to use spawn_cancellable is the right call and eliminates the source of flakiness.
  • One small follow-up I recommend: in the retry “reschedule” path, reacquire scheduler resources with the per-task token as well, for consistency.

Details
What was flaky and why

  • The original doctest attempted to cancel a task started with spawn. RegularTaskExecutor ignores the cancel token and the future didn’t check any token, so cancellation via handle.cancellation_token() could not affect the task. That made the assertion “assert!(result.is_err())” racey or just wrong.
  • Switching the doctest to spawn_cancellable and cooperatively checking cancel_token.cancelled() makes cancellation deterministic and aligned with the API.

Correctness of code changes

  • Passing the task’s cancellation token to both executor and continuation:
    • Before: executor/continuation received inner.cancel_token.child_token(), which was not the same as the handle’s token. Per-task cancellation via handle.cancellation_token().cancel() would not be observed by the task.
    • Now: executor/continuation receive task_cancellation_token.clone(), so per-task cancellation is observed by the task/continuations. Tracker-level cancellation still propagates, because the per-task token is a child of the tracker’s token. This is the correct semantics and backwards-compatible.
  • Doctest note clarifying spawn vs spawn_cancellable accurately reflects the design: regular tasks won’t observe cancellation unless they explicitly check a token they own; spawn_cancellable is the mechanism to wire in cooperative cancellation.

One minor inconsistency to consider fixing

  • In execute_with_retry_loop, the initial resource acquisition uses task_cancellation_token.child_token() (good), but when should_reschedule() is true, the reacquisition path uses inner.cancel_token.child_token(). This means a per-task cancel won’t cancel reacquisition in the reschedule leg.
  • Suggested change: use the task token for reacquisition too.

Minimal patch
Rust (tracker.rs, within TaskTrackerInner::execute_with_retry_loop)

// current (reschedule reacquisition)
guard_result = async {
    inner
        .scheduler
        .acquire_execution_slot(inner.cancel_token.child_token())
        .await
}
.instrument(tracing::debug_span!("scheduler_resource_reacquisition"))
.await;

// recommended
guard_result = async {
    inner
        .scheduler
        .acquire_execution_slot(task_cancellation_token.child_token())
        .await
}
.instrument(tracing::debug_span!("scheduler_resource_reacquisition"))
.await;

Why this helps

  • Keeps scheduler-level cancellation consistent across all attempts: a per-task cancel will cancel both the initial acquisition and any later reacquisitions when policies request rescheduling.
  • Preserves parent->child propagation semantics (tracker cancellation still cancels the per-task token).

Is there a simpler alternative to your fix?

  • For the doctest: the only “simpler” alternative would be to avoid asserting and just show the API usage, or to demonstrate abort() instead of cooperative cancellation. Both would reduce educational value. Using spawn_cancellable is the correct and simplest way to demonstrate “cancel this task via its handle token.”
  • For the internal change: using the per-task token is the minimal change that makes handle-driven cancellation actually work; anything else would either be more invasive or semantically wrong.

Overall recommendation

  • Keep the doctest and the executor/continuation token changes as-is.
  • Apply the tiny follow-up change above for the reschedule branch to make cancellation semantics fully consistent.
  • Optional: add a focused test later that cancels a task during a rescheduled acquisition to cement the behavior.

If you want, I can sketch a small test for the reschedule-acquisition case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant