fix(cudf): Fix UCX output buffer lifetime after async cuDF concat#17533
fix(cudf): Fix UCX output buffer lifetime after async cuDF concat#17533kjmph wants to merge 2 commits into
Conversation
✅ Deploy Preview for meta-velox canceled.
|
|
Hi @kjmph! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
| #include <cudf/column/column.hpp> | ||
| #include <cudf/table/table.hpp> | ||
|
|
||
| #if __has_include(<cudf/column/column_stream.hpp>) |
There was a problem hiding this comment.
Is this for backwards compatibility with older cudf versions? We should unconditionally include and use this feature, and update our pinned cudf commit if needed to support it. Run scripts/update-cudf-deps.sh --branch main.
There was a problem hiding this comment.
That's correct, the intention was backwards compatibility. It seems upgrading the dependencies would be a larger change? Or, do we prefer to serialize these changes so we don't need to maintain conditional includes? This Use-after-free PR could be broken out to not include rebind_stream, if it we want to concurrently fix the UAF and update dependencies. (Since I'm assuming updating dependencies is a full test / regression cycle?)
There was a problem hiding this comment.
open PR #17572 is upgrading cudf commit.
This could PR could follow after that PR merges.
There was a problem hiding this comment.
Sorry for the delayed response.
If you need a new feature/fix, please use scripts/update-cudf-deps.sh to update. As long as cuDF support is experimental, we are fine letting the version requirement float forward when updates are needed. Eventually we will need to establish a version support policy.
I proposed in #17572 (really in zoltan#1) to adopt the latest commits from --branch release/26.06 because the 26.08 changes in main aren't compatible until we land some form of CMake 4.0 update, which is in progress in #17637.
There was a problem hiding this comment.
I traced through this. I started this on an older branch, and migrated it to the newest branch. The commit in this PR is still valid, but in the intervening time cuDF was upgraded with this subsequent commit:
54bac93
Which upgraded the cuDF dependency to d09d10d14d3ed932b8de93638809101af5c7fec3. If I check that version I see that it does have rebind_stream:
https://github.com/rapidsai/cudf/blob/d09d10d14d3ed932b8de93638809101af5c7fec3/cpp/include/cudf/column/column_stream.hpp#L21-L40
So, perhaps we can remove this backwards compatibility shim now, and not serialize this PR behind another upgrade cycle?
There was a problem hiding this comment.
Yes, that's correct. We should be clear to remove this compatibility shim now that we are pinning a newer cudf commit.
Build Impact AnalysisFull build recommended. Files outside the dependency graph changed:
These directories are not fully covered by the dependency graph. A full build is the safest option. Fast path • Graph from main@905d3a48e7238ad44559ff2003cd49b0d67f0b2c |
CI Failure Analysis
❌ Linux release with adapters — BUILD Failure View logsBuild error:
Correlation with PR changes: This failure is not caused by the PR's own code changes — the PR only modifies files in However, the failure is caused by the PR branch being out of date with main. Specifically:
The "Linux release with adapters" job passes on main (run 26001024670). Known issues:
Recommended fix: Rebase the PR branch onto current git fetch upstream
git rebase upstream/main
git push --force-with-leaseThis should resolve the build failure without any code changes to this PR. |
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
997def3 to
832f70e
Compare
| if (auto* tablePtr = | ||
| std::get_if<std::unique_ptr<cudf::table>>(&tableStorage_)) { | ||
| if (!*tablePtr) { |
There was a problem hiding this comment.
should we ask libcudf for rebind_stream for packed_table or packed_column too?
Open up an issue in cudf repo.
There was a problem hiding this comment.
If yes; It is possible to implement it here. But I suggest to ask libcudf to provide one.
There was a problem hiding this comment.
Ah, that's a good observation. I recommend that we implement that in libcudf.
There was a problem hiding this comment.
Thank you! It seems that this is a miss in the proposed PR, we don't handle packed_table correctly. Irrespective of libcudf support, would we accept this change?
bool CudfVector::rebindStream(rmm::cuda_stream_view stream) {
if (auto* tablePtr =
std::get_if<std::unique_ptr<cudf::table>>(&tableStorage_)) {
if (!*tablePtr) {
return false;
}
if (stream_.value() == stream.value()) {
return true;
}
#if VELOX_CUDF_HAS_COLUMN_REBIND_STREAM
auto columns = (*tablePtr)->release();
for (auto& column : columns) {
column = cudf::rebind_stream(std::move(*column), stream);
}
*tablePtr = std::make_unique<cudf::table>(std::move(columns));
tabView_ = (*tablePtr)->view();
stream_ = stream;
return true;
#else
return false;
#endif
}
if (auto* packedPtr =
std::get_if<std::unique_ptr<cudf::packed_table>>(&tableStorage_)) {
if (!*packedPtr) {
return false;
}
(*packedPtr)->data.gpu_data->set_stream(stream);
stream_ = stream;
return true;
}
return false;
}To ensure this is correct, I created two new test cases, rebindOwnedTableDeallocationStream and rebindPackedTableDeallocationStream. I can verify that the second test case failed previously, but with the above version of CudfVector::rebindStream, it passes. I can add these two tests to this PR, if that's reasonable?
There was a problem hiding this comment.
(Side note, see the other discussion, maybe we ran remove the #if / #else)
There was a problem hiding this comment.
would we accept this change?
Great idea. Yes, this seems reasonable.
And yes, please add those tests. As noted above, we can assume VELOX_CUDF_HAS_COLUMN_REBIND_STREAM now.
| } | ||
|
|
||
| if (!allRebound) { | ||
| CudaEvent event(cudaEventDisableTiming); |
There was a problem hiding this comment.
Do we just want to do the same thing for Velox that we're doing in libcudf? Seems like a similar use case.
There was a problem hiding this comment.
Would you like me to propose a version that does this? In the same PR branch?
There was a problem hiding this comment.
The libcudf solution seems fine. I don't see any issues using a TLS event for this use case.
There was a problem hiding this comment.
Yes, I think so. Let's adopt that TLS event solution here.
UcxPartitionedOutput::flushPending() synchronizes input streams before launching cudf::concatenate(), but concatenate itself is asynchronous on the output stream. When multiple pending inputs were flushed, the operator could clear pendingInputs_ immediately after launching concat, allowing the input CudfVector buffers to be deallocated before the concat kernels finished reading them. This caused nondeterministic GPU result corruption. A TPC-H Q18 correctness loop against a DuckDB reference reproduced the issue regularly, and compute-sanitizer reported a use-after-free with this stack: cudf::concatenate UcxPartitionedOutput::flushPending Fix the lifetime ordering without a CPU-blocking synchronize: * make the concat/output stream wait for all input streams before reading input table views; * prefer rebinding owned cuDF input buffers to the output stream so their stream-ordered deallocation happens after concat work on that stream; * fall back to a single event recorded on the output stream and waited by all input streams when inputs cannot be rebound. This keeps the existing memory behavior of releasing pending input buffers before partitioning, while making their async lifetime ordering explicit. Validated by: * TPC-H Q18 GPU correctness loop against stored DuckDB reference * compute-sanitizer with stream-ordered race tracking, confirming the original concat use-after-free is no longer reported
Update the cuDF vector deallocation ordering helpers to accept std::span instead of concrete std::vector references. These helpers only need read-only views over CudfVector pointers and CUDA streams.
832f70e to
a844318
Compare
UcxPartitionedOutput::flushPending() synchronizes input streams before launching cudf::concatenate(), but concatenate itself is asynchronous on the output stream. When multiple pending inputs were flushed, the operator could clear pendingInputs_ immediately after launching concat, allowing the input CudfVector buffers to be deallocated before the concat kernels finished reading them.
This caused nondeterministic GPU result corruption. A TPC-H Q18 correctness loop against a DuckDB reference reproduced the issue regularly, and compute-sanitizer reported a use-after-free with this stack:
cudf::concatenate
UcxPartitionedOutput::flushPending
Fix the lifetime ordering without a CPU-blocking synchronize:
This keeps the existing memory behavior of releasing pending input buffers before partitioning, while making their async lifetime ordering explicit.
Validated by: