Skip to content

fix(cudf): Fix UCX output buffer lifetime after async cuDF concat#17533

Open
kjmph wants to merge 2 commits into
facebookincubator:mainfrom
kjmph:fix/experimental-cudf/use-after-free
Open

fix(cudf): Fix UCX output buffer lifetime after async cuDF concat#17533
kjmph wants to merge 2 commits into
facebookincubator:mainfrom
kjmph:fix/experimental-cudf/use-after-free

Conversation

@kjmph
Copy link
Copy Markdown

@kjmph kjmph commented May 15, 2026

UcxPartitionedOutput::flushPending() synchronizes input streams before launching cudf::concatenate(), but concatenate itself is asynchronous on the output stream. When multiple pending inputs were flushed, the operator could clear pendingInputs_ immediately after launching concat, allowing the input CudfVector buffers to be deallocated before the concat kernels finished reading them.

This caused nondeterministic GPU result corruption. A TPC-H Q18 correctness loop against a DuckDB reference reproduced the issue regularly, and compute-sanitizer reported a use-after-free with this stack:

cudf::concatenate
UcxPartitionedOutput::flushPending

Fix the lifetime ordering without a CPU-blocking synchronize:

  • make the concat/output stream wait for all input streams before reading input table views;
  • prefer rebinding owned cuDF input buffers to the output stream so their stream-ordered deallocation happens after concat work on that stream;
  • fall back to a single event recorded on the output stream and waited by all input streams when inputs cannot be rebound.

This keeps the existing memory behavior of releasing pending input buffers before partitioning, while making their async lifetime ordering explicit.

Validated by:

  • TPC-H Q18 GPU correctness loop against stored DuckDB reference
  • compute-sanitizer with stream-ordered race tracking, confirming the original concat use-after-free is no longer reported

@netlify
Copy link
Copy Markdown

netlify Bot commented May 15, 2026

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit a844318
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/6a218dac8861f00008adcd86

@meta-cla
Copy link
Copy Markdown

meta-cla Bot commented May 15, 2026

Hi @kjmph!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

#include <cudf/column/column.hpp>
#include <cudf/table/table.hpp>

#if __has_include(<cudf/column/column_stream.hpp>)
Copy link
Copy Markdown
Collaborator

@bdice bdice May 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for backwards compatibility with older cudf versions? We should unconditionally include and use this feature, and update our pinned cudf commit if needed to support it. Run scripts/update-cudf-deps.sh --branch main.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, the intention was backwards compatibility. It seems upgrading the dependencies would be a larger change? Or, do we prefer to serialize these changes so we don't need to maintain conditional includes? This Use-after-free PR could be broken out to not include rebind_stream, if it we want to concurrently fix the UAF and update dependencies. (Since I'm assuming updating dependencies is a full test / regression cycle?)

Copy link
Copy Markdown
Collaborator

@karthikeyann karthikeyann Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open PR #17572 is upgrading cudf commit.
This could PR could follow after that PR merges.

Copy link
Copy Markdown
Collaborator

@bdice bdice Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delayed response.

If you need a new feature/fix, please use scripts/update-cudf-deps.sh to update. As long as cuDF support is experimental, we are fine letting the version requirement float forward when updates are needed. Eventually we will need to establish a version support policy.

I proposed in #17572 (really in zoltan#1) to adopt the latest commits from --branch release/26.06 because the 26.08 changes in main aren't compatible until we land some form of CMake 4.0 update, which is in progress in #17637.

Copy link
Copy Markdown
Author

@kjmph kjmph Jun 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I traced through this. I started this on an older branch, and migrated it to the newest branch. The commit in this PR is still valid, but in the intervening time cuDF was upgraded with this subsequent commit:
54bac93

Which upgraded the cuDF dependency to d09d10d14d3ed932b8de93638809101af5c7fec3. If I check that version I see that it does have rebind_stream:
https://github.com/rapidsai/cudf/blob/d09d10d14d3ed932b8de93638809101af5c7fec3/cpp/include/cudf/column/column_stream.hpp#L21-L40

So, perhaps we can remove this backwards compatibility shim now, and not serialize this PR behind another upgrade cycle?

Copy link
Copy Markdown
Collaborator

@bdice bdice Jun 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct. We should be clear to remove this compatibility shim now that we are pinning a newer cudf commit.

Comment thread velox/experimental/cudf/exec/Utilities.cpp Outdated
Comment thread velox/experimental/cudf/vector/CudfVector.cpp
@mbasmanova mbasmanova changed the title Fix UCX output buffer lifetime after async cuDF concat fix(cudf): Fix UCX output buffer lifetime after async cuDF concat May 18, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 18, 2026

Build Impact Analysis

Full build recommended. Files outside the dependency graph changed:

  • velox/experimental/cudf/exec/Utilities.cpp
  • velox/experimental/cudf/exec/Utilities.h
  • velox/experimental/cudf/vector/CudfVector.cpp
  • velox/experimental/cudf/vector/CudfVector.h
  • velox/experimental/ucx-exchange/UcxPartitionedOutput.cpp

These directories are not fully covered by the dependency graph. A full build is the safest option.

cmake --build _build/release

Fast path • Graph from main@905d3a48e7238ad44559ff2003cd49b0d67f0b2c

@github-actions
Copy link
Copy Markdown

CI Failure Analysis

Auto-generated by the CI Failure Analysis workflow. This comment is updated in place each time CI fails on a new commit, so it always reflects the latest run — re-pushing or re-running CI will refresh the analysis below. Last updated 2026-05-18 02:19:38 UTC from workflow run 25934647000.

❌ Linux release with adapters — BUILD Failure View logs

Build error:

velox/experimental/cudf/connectors/hive/CudfSplitReader.cpp:77:7: error:
  no matching function for call to
  'facebook::velox::dwio::common::ReaderOptions::ReaderOptions(
      facebook::velox::memory::MemoryPool*&,
      std::__shared_ptr<facebook::velox::io::IoStatistics, ...>::element_type*,
      std::__shared_ptr<facebook::velox::io::IoStatistics, ...>::element_type*)'

   77 |       baseReaderOpts_(pool_, ioStatistics_.get(), ioStatistics_.get()),
      |       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

note: candidate: 'ReaderOptions(velox::memory::MemoryPool*)' — expects 1 argument, 3 provided
  • File: velox/experimental/cudf/connectors/hive/CudfSplitReader.cpp:77
  • Root cause: The ReaderOptions constructor is called with 3 arguments, but the class only has a 1-argument constructor taking MemoryPool*.

Correlation with PR changes:

This failure is not caused by the PR's own code changes — the PR only modifies files in velox/experimental/cudf/exec/, velox/experimental/cudf/vector/, and velox/experimental/ucx-exchange/. The failing file CudfSplitReader.cpp is not touched by this PR's diff.

However, the failure is caused by the PR branch being out of date with main. Specifically:

The "Linux release with adapters" job passes on main (run 26001024670).

Known issues:

Recommended fix:

Rebase the PR branch onto current main to pick up the fix from PR #17532:

git fetch upstream
git rebase upstream/main
git push --force-with-lease

This should resolve the build failure without any code changes to this PR.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 27, 2026
@meta-cla
Copy link
Copy Markdown

meta-cla Bot commented May 27, 2026

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

Comment on lines +191 to +193
if (auto* tablePtr =
std::get_if<std::unique_ptr<cudf::table>>(&tableStorage_)) {
if (!*tablePtr) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we ask libcudf for rebind_stream for packed_table or packed_column too?
Open up an issue in cudf repo.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If yes; It is possible to implement it here. But I suggest to ask libcudf to provide one.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's a good observation. I recommend that we implement that in libcudf.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! It seems that this is a miss in the proposed PR, we don't handle packed_table correctly. Irrespective of libcudf support, would we accept this change?

bool CudfVector::rebindStream(rmm::cuda_stream_view stream) {
  if (auto* tablePtr =
          std::get_if<std::unique_ptr<cudf::table>>(&tableStorage_)) {
    if (!*tablePtr) {
      return false;
    }

    if (stream_.value() == stream.value()) {
      return true;
    }

#if VELOX_CUDF_HAS_COLUMN_REBIND_STREAM
    auto columns = (*tablePtr)->release();
    for (auto& column : columns) {
      column = cudf::rebind_stream(std::move(*column), stream);
    }

    *tablePtr = std::make_unique<cudf::table>(std::move(columns));
    tabView_ = (*tablePtr)->view();
    stream_ = stream;
    return true;
#else
    return false;
#endif
  }

  if (auto* packedPtr =
          std::get_if<std::unique_ptr<cudf::packed_table>>(&tableStorage_)) {
    if (!*packedPtr) {
      return false;
    }

    (*packedPtr)->data.gpu_data->set_stream(stream);
    stream_ = stream;
    return true;
  }

  return false;
}

To ensure this is correct, I created two new test cases, rebindOwnedTableDeallocationStream and rebindPackedTableDeallocationStream. I can verify that the second test case failed previously, but with the above version of CudfVector::rebindStream, it passes. I can add these two tests to this PR, if that's reasonable?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Side note, see the other discussion, maybe we ran remove the #if / #else)

Copy link
Copy Markdown
Collaborator

@bdice bdice Jun 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would we accept this change?

Great idea. Yes, this seems reasonable.

And yes, please add those tests. As noted above, we can assume VELOX_CUDF_HAS_COLUMN_REBIND_STREAM now.

}

if (!allRebound) {
CudaEvent event(cudaEventDisableTiming);
Copy link
Copy Markdown
Collaborator

@karthikeyann karthikeyann Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @bdice @vyasr please suggest an idea here.
Event creation is costly. The join_stream in cudf uses a thread_local static to avoid creating repeatedly.

https://github.com/rapidsai/cudf/blob/3b337d749d13bc8ab3a481cf09d1b5a66dd136c9/cpp/src/utilities/stream_pool.cpp#L96

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we just want to do the same thing for Velox that we're doing in libcudf? Seems like a similar use case.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like me to propose a version that does this? In the same PR branch?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The libcudf solution seems fine. I don't see any issues using a TLS event for this use case.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so. Let's adopt that TLS event solution here.

kjmph added 2 commits June 4, 2026 10:37
UcxPartitionedOutput::flushPending() synchronizes input streams before
launching cudf::concatenate(), but concatenate itself is asynchronous
on the output stream. When multiple pending inputs were flushed, the
operator could clear pendingInputs_ immediately after launching
concat, allowing the input CudfVector buffers to be deallocated before
the concat kernels finished reading them.

This caused nondeterministic GPU result corruption. A TPC-H Q18
correctness loop against a DuckDB reference reproduced the issue
regularly, and compute-sanitizer reported a use-after-free with this
stack:

  cudf::concatenate
  UcxPartitionedOutput::flushPending

Fix the lifetime ordering without a CPU-blocking synchronize:

* make the concat/output stream wait for all input streams before
  reading input table views;
* prefer rebinding owned cuDF input buffers to the output stream so
  their stream-ordered deallocation happens after concat work on that
  stream;
* fall back to a single event recorded on the output stream and waited
  by all input streams when inputs cannot be rebound.

This keeps the existing memory behavior of releasing pending input
buffers before partitioning, while making their async lifetime
ordering explicit.

Validated by:
* TPC-H Q18 GPU correctness loop against stored DuckDB reference
* compute-sanitizer with stream-ordered race tracking, confirming the
  original concat use-after-free is no longer reported
Update the cuDF vector deallocation ordering helpers to accept
std::span instead of concrete std::vector references. These helpers
only need read-only views over CudfVector pointers and CUDA streams.
@kjmph kjmph force-pushed the fix/experimental-cudf/use-after-free branch from 832f70e to a844318 Compare June 4, 2026 14:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants