Document guidelines for physical operator yielding #15030

carols10cents · 2025-03-05T15:28:06Z

To start a policy of the behavior physical operator streams should have and drive improvements in this area to allow for timely cancellation.

I'm very open to any wording changes and looking forward to discussion on whether I've accurately captured the intent of datafusion!

Which issue does this PR close?

Connects to #14036 and related to pull requests such as #14028.

Rationale for this change

Datafusion should be explicit about its aspirations for what it provides as well as provide guidelines for trait implementers about yielding to enable timely cancellation.

What changes are included in this PR?

A policy statement in the documentation with some reference links.

Are these changes tested?

No, this is only documentation.

Are there any user-facing changes?

This is the documentation :)

To start a policy of the behavior physical operator streams should have and drive improvements in this area to allow for timely cancellation. Connects to apache#14036 and related to pull requests such as apache#14028.

alamb

Thank you so much @carols10cents -- I think this is super valuable, and in my opinion is not changing the DataFusion semantics, but simply documenting what is implicitly already done in the code.

cc @berkaysynnada / @ozankabak in case you have other thoughts in this area.

datafusion/physical-plan/src/execution_plan.rs

berkaysynnada · 2025-03-06T07:51:05Z

datafusion/physical-plan/src/execution_plan.rs

+    /// batches.
+    ///
+    /// The goal is for `datafusion`-provided operator implementation to
+    /// strive for [the guideline of not spending a long time without reaching


I don't know if there is any implementation such that it manually yields because of spending long time without a yield point (for CPU bound works), and I am also not very sure we could need that.

I'm not sure if I'm understanding what you're saying correctly-- but I believe datafusion does need this as demonstrated by the cancellation benchmarks I recently added that show there's at least one case where it takes 32ms (on @alamb's machine) to cancel/drop the runtime because there are operations that aren't yielding often enough (the guidelines I've seen suggest aiming for 1ms of work between yield points).

There are also issues such as #14036 where queries aren't able to be cancelled, and the root cause also appears to be not yielding often enough.

Could you elaborate on why you don't think datafusion needs manual yields, based on the behavior of the benchmarks and uncancellable query issues?

I think what he means is that in most cases a manual yield shouldn't be necessary for CPU-bound operators in most cases (not all). It is possible for this to be necessary in certain situations (very large batch sizes, operator does superlinear-complexity work w.r.t. batch size etc.), but it shouldn't be a common situation. I think the documentation should probably state this, and give some concrete examples when manual yielding may be necessary.

I think what he means is that in most cases a manual yield shouldn't be necessary for CPU-bound operators in most cases (not all).

I think we can state that if you work packages are record batches and your compute complexity is linear (like filter and map/project operation), then you probably don't need this. But if you do any form of aggregation or super-linear behavior (e.g. unnest, data decompression), then you must think about that issue.

Could someone make a concrete suggestion here on how the text should be changed for discussion?

ozankabak · 2025-03-06T21:32:45Z

Thanks for improving the docs, left my suggestions inline

carols10cents · 2025-03-07T14:58:23Z

I just pushed some more commits addressing some comments; there is one TODO commit in there that I will update once #15054 has been merged in so that I can link to the relevant part of the benchmarks readme that I added in that PR. I will squash all these commits when we're done revising.

ozankabak

Left a suggestion per your request

datafusion/physical-plan/src/execution_plan.rs

alamb · 2025-03-12T20:18:30Z

I just pushed some more commits addressing some comments; there is one TODO commit in there that I will update once #15054 has been merged in so that I can link to the relevant part of the benchmarks readme that I added in that PR. I will squash all these commits when we're done revising.

#15054 has been merged. @carols10cents are you willing to update this PR again? If not I can do so too

Co-authored-by: Mehmet Ozan Kabak <[email protected]>

carols10cents · 2025-03-13T14:42:45Z

#15054 has been merged. @carols10cents are you willing to update this PR again? If not I can do so too

Whoops, just updated! I also took @ozankabak's suggestion since there were no objections; thank you!

How is this looking now?

ozankabak

LGTM - thank you!

alamb

Thank you @carols10cents and @ozankabak @berkaysynnada and @crepererum

alamb · 2025-03-14T19:00:01Z

🚀 📖

Document guidelines for physical operator yielding

be38639

To start a policy of the behavior physical operator streams should have and drive improvements in this area to allow for timely cancellation. Connects to apache#14036 and related to pull requests such as apache#14028.

alamb reviewed Mar 5, 2025

View reviewed changes

datafusion/physical-plan/src/execution_plan.rs Outdated Show resolved Hide resolved

datafusion/physical-plan/src/execution_plan.rs Outdated Show resolved Hide resolved

berkaysynnada reviewed Mar 6, 2025

View reviewed changes

datafusion/physical-plan/src/execution_plan.rs Outdated Show resolved Hide resolved

berkaysynnada reviewed Mar 6, 2025

View reviewed changes

carols10cents added 3 commits March 6, 2025 14:27

Remove discussion of tokio coop

14560df

Move rationale up

5cf7332

TODO ADD LINK

3d1257e

carols10cents mentioned this pull request Mar 6, 2025

Improve benchmark documentation #15054

Merged

carols10cents added 2 commits March 6, 2025 14:49

Say block the CPU rather than pin the CPU

d4c1e14

Add a caveat to use the right tool for the situation

286f62e

ozankabak reviewed Mar 7, 2025

View reviewed changes

datafusion/physical-plan/src/execution_plan.rs Outdated Show resolved Hide resolved

carols10cents and others added 5 commits March 13, 2025 10:35

Improve documentation of yield guidelines

c56523b

Co-authored-by: Mehmet Ozan Kabak <[email protected]>

Merge remote-tracking branch 'upstream/main' into yielding-guideline

c88511f

Fix newlines and whitespace in comment

ab1c086

Add a link to the cancellation benchmark documented in the README

e5d0f8d

Fix newlines in benchmarks README

d42889b

ozankabak approved these changes Mar 13, 2025

View reviewed changes

alamb approved these changes Mar 13, 2025

View reviewed changes

alamb merged commit 072098e into apache:main Mar 14, 2025
26 checks passed

carols10cents deleted the yielding-guideline branch March 14, 2025 19:57

Document guidelines for physical operator yielding #15030

Document guidelines for physical operator yielding #15030

Uh oh!

Conversation

carols10cents commented Mar 5, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

berkaysynnada Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

carols10cents Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

ozankabak Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

crepererum Mar 7, 2025

Choose a reason for hiding this comment

Uh oh!

carols10cents Mar 7, 2025

Choose a reason for hiding this comment

Uh oh!

ozankabak commented Mar 6, 2025

Uh oh!

carols10cents commented Mar 7, 2025

Uh oh!

ozankabak left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb commented Mar 12, 2025

Uh oh!

carols10cents commented Mar 13, 2025

Uh oh!

ozankabak left a comment

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb commented Mar 14, 2025

Uh oh!

Uh oh!