test: Fix flaky task tracker doc test v1 #3414

rmccorm4 · 2025-10-03T20:19:19Z

Overview:

Currently the doc test for task tracker is flaky, if you run this in a loop you should see occasional failures:

#!/bin/bash

iteration=1
while true; do
    echo "=== Iteration $iteration ==="
    
    cargo test --locked --doc
    
    if [ $? -ne 0 ]; then
        echo ""
        echo "❌ FAILURE: Doc tests failed on iteration $iteration"
        exit 1
    fi
    
    echo "✓ Iteration $iteration passed"
    echo ""
    
    ((iteration++))
done

This PR aims to resolve the flakiness for more reliable CI.

Passes 100/100 time for me after this change.

Details

Example failure: https://github.com/ai-dynamo/dynamo/actions/runs/18143233794/job/51639015836

failures:

---- lib/runtime/src/utils/tasks/tracker.rs - utils::tasks::tracker::TaskHandle<T>::cancellation_token (line 497) stdout ----
Test executable failed (exit status: 101).

stderr:

thread 'main' panicked at /tmp/rustdoctestxMyMdq/doctest_bundle_2024.rs:597:1:
assertion failed: result.is_err()
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Summary by CodeRabbit

Bug Fixes
- Unified task cancellation handling to consistently use each task’s own token, improving predictability and reducing missed cancellations during retries and continuations.
Documentation
- Clarified examples on task cancellation, including guidance on using cancellable spawning for automatic cancellation.

Signed-off-by: Ryan McCormick <[email protected]>

rmccorm4 · 2025-10-04T18:26:31Z

@coderabbitai review

coderabbitai · 2025-10-04T18:26:56Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai · 2025-10-04T18:27:22Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai · 2025-10-04T18:29:10Z

Walkthrough

Updates task execution to pass each task’s own cancellation token through all execution and continuation paths, replacing prior usage of child tokens from an inner token. Documentation gains a note clarifying spawn_cancellable behavior. No public APIs changed.

Changes

Cohort / File(s)	Summary of Changes
Cancellation token propagation `lib/runtime/src/utils/tasks/tracker.rs`	Replaced `inner.cancel_token.child_token()` with the task’s `task_cancellation_token.clone()` across regular and cancellable executor paths, including continuations within the retry loop; retained control flow and error handling; added doc note in `TaskHandle::cancellation_token` example referencing `spawn_cancellable`.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Caller
  participant Tracker
  participant Executor as TaskExecutor
  participant Task
  participant Cont as Continuation
  participant CT as CancellationToken

  Caller->>Tracker: spawn task
  Tracker->>CT: obtain task-specific token
  loop retry until completion
    Tracker->>Executor: execute(Task, CT)
    alt success
      Executor-->>Tracker: completed
    else continuation
      Executor-->>Tracker: get continuation
      Tracker->>Cont: execute(CT)
      Cont-->>Tracker: result/next
    else error/retry
      Executor-->>Tracker: error
    end
  end
  Tracker-->>Caller: Task result/handle

  note over Tracker,Executor: CT is the task’s own token (no child token)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

A hare taps keys at twilight’s brink,
Swaps child for task—one token link.
Retries hop on, continuations queue,
Cancellable burrows, tidy and true.
Docs nibble notes for those who spawn—
A gentle fix, then off I yawn. 🐇✨

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The description includes an overview of the flakiness issue and an example failure, but it omits required sections from the repository template such as “Where should the reviewer start?” and “Related Issues,” and the Details section does not describe the actual code changes made to resolve the flakiness.	Please add a “#### Where should the reviewer start?” section pointing to the key files (e.g., lib/runtime/src/utils/tasks/tracker.rs) and a “#### Related Issues” section with any issue links or action keywords, and expand the Details section to summarize the actual code changes introduced to fix the flaky doc test.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title clearly indicates that the pull request addresses a flaky documentation test in the task tracker, which aligns with the author’s primary goal of fixing test flakiness.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e1ed700 and b91824a.

📒 Files selected for processing (1)

lib/runtime/src/utils/tasks/tracker.rs (2 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

rmccorm4 · 2025-10-04T18:32:51Z

@coderabbitai what do you think of this fix? Is it correct, or is there a better or simpler alternative? Is the fix in the spirit of the original doctest? Review thoroughly.

coderabbitai · 2025-10-04T18:36:36Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Get the PR diff to see what changed
gh pr diff 3414 --repo ai-dynamo/dynamo

Length of output: 2685

🏁 Script executed:

#!/bin/bash
# Read the current state of the file to see the doctest
cat lib/runtime/src/utils/tasks/tracker.rs

Length of output: 247471

🏁 Script executed:

#!/bin/bash
# Search for any related test files or documentation about task tracker
fd -e rs -e md . lib/runtime/src/utils/tasks/ --exec cat {}

Length of output: 271477

@rmccorm4

Short answer

Your fix is correct, materially improves correctness (not just the doctest), and is in the spirit of the original intent: “cancel this specific task via the TaskHandle’s token.”
The doctest change to use spawn_cancellable is the right call and eliminates the source of flakiness.
One small follow-up I recommend: in the retry “reschedule” path, reacquire scheduler resources with the per-task token as well, for consistency.

Details
What was flaky and why

The original doctest attempted to cancel a task started with spawn. RegularTaskExecutor ignores the cancel token and the future didn’t check any token, so cancellation via handle.cancellation_token() could not affect the task. That made the assertion “assert!(result.is_err())” racey or just wrong.
Switching the doctest to spawn_cancellable and cooperatively checking cancel_token.cancelled() makes cancellation deterministic and aligned with the API.

Correctness of code changes

Passing the task’s cancellation token to both executor and continuation:
- Before: executor/continuation received inner.cancel_token.child_token(), which was not the same as the handle’s token. Per-task cancellation via handle.cancellation_token().cancel() would not be observed by the task.
- Now: executor/continuation receive task_cancellation_token.clone(), so per-task cancellation is observed by the task/continuations. Tracker-level cancellation still propagates, because the per-task token is a child of the tracker’s token. This is the correct semantics and backwards-compatible.
Doctest note clarifying spawn vs spawn_cancellable accurately reflects the design: regular tasks won’t observe cancellation unless they explicitly check a token they own; spawn_cancellable is the mechanism to wire in cooperative cancellation.

One minor inconsistency to consider fixing

In execute_with_retry_loop, the initial resource acquisition uses task_cancellation_token.child_token() (good), but when should_reschedule() is true, the reacquisition path uses inner.cancel_token.child_token(). This means a per-task cancel won’t cancel reacquisition in the reschedule leg.
Suggested change: use the task token for reacquisition too.

Minimal patch
Rust (tracker.rs, within TaskTrackerInner::execute_with_retry_loop)

// current (reschedule reacquisition)
guard_result = async {
    inner
        .scheduler
        .acquire_execution_slot(inner.cancel_token.child_token())
        .await
}
.instrument(tracing::debug_span!("scheduler_resource_reacquisition"))
.await;

// recommended
guard_result = async {
    inner
        .scheduler
        .acquire_execution_slot(task_cancellation_token.child_token())
        .await
}
.instrument(tracing::debug_span!("scheduler_resource_reacquisition"))
.await;

Why this helps

Keeps scheduler-level cancellation consistent across all attempts: a per-task cancel will cancel both the initial acquisition and any later reacquisitions when policies request rescheduling.
Preserves parent->child propagation semantics (tracker cancellation still cancels the per-task token).

Is there a simpler alternative to your fix?

For the doctest: the only “simpler” alternative would be to avoid asserting and just show the API usage, or to demonstrate abort() instead of cooperative cancellation. Both would reduce educational value. Using spawn_cancellable is the correct and simplest way to demonstrate “cancel this task via its handle token.”
For the internal change: using the per-task token is the minimal change that makes handle-driven cancellation actually work; anything else would either be more invasive or semantically wrong.

Overall recommendation

Keep the doctest and the executor/continuation token changes as-is.
Apply the tiny follow-up change above for the reschedule branch to make cancellation semantics fully consistent.
Optional: add a focused test later that cancels a task during a rescheduled acquisition to cement the behavior.

If you want, I can sketch a small test for the reschedule-acquisition case.

Fix flaky task tracker doc test v1

b91824a

Signed-off-by: Ryan McCormick <[email protected]>

pull-request-size bot added the size/S label Oct 3, 2025

github-actions bot added the test label Oct 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test: Fix flaky task tracker doc test v1 #3414

test: Fix flaky task tracker doc test v1 #3414

Uh oh!

rmccorm4 commented Oct 3, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

rmccorm4 commented Oct 4, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Oct 4, 2025

Uh oh!

coderabbitai bot commented Oct 4, 2025

Uh oh!

coderabbitai bot commented Oct 4, 2025

Uh oh!

rmccorm4 commented Oct 4, 2025

Uh oh!

coderabbitai bot commented Oct 4, 2025

Uh oh!

Uh oh!

test: Fix flaky task tracker doc test v1 #3414

Are you sure you want to change the base?

test: Fix flaky task tracker doc test v1 #3414

Uh oh!

Conversation

rmccorm4 commented Oct 3, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details

Summary by CodeRabbit

Uh oh!

rmccorm4 commented Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Oct 4, 2025

Uh oh!

coderabbitai bot commented Oct 4, 2025

Uh oh!

coderabbitai bot commented Oct 4, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

rmccorm4 commented Oct 4, 2025

Uh oh!

coderabbitai bot commented Oct 4, 2025

Uh oh!

Uh oh!

rmccorm4 commented Oct 3, 2025 •

edited by coderabbitai bot

Loading

rmccorm4 commented Oct 4, 2025 •

edited

Loading