-
Notifications
You must be signed in to change notification settings - Fork 1.5k
fix(cudf): Fix UCX output buffer lifetime after async cuDF concat #17533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -25,6 +25,13 @@ | |
| #include <cudf/column/column.hpp> | ||
| #include <cudf/table/table.hpp> | ||
|
|
||
| #if __has_include(<cudf/column/column_stream.hpp>) | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this for backwards compatibility with older cudf versions? We should unconditionally include and use this feature, and update our pinned cudf commit if needed to support it. Run
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's correct, the intention was backwards compatibility. It seems upgrading the dependencies would be a larger change? Or, do we prefer to serialize these changes so we don't need to maintain conditional includes? This Use-after-free PR could be broken out to not include
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. open PR #17572 is upgrading cudf commit.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry for the delayed response. If you need a new feature/fix, please use I proposed in #17572 (really in zoltan#1) to adopt the latest commits from
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I traced through this. I started this on an older branch, and migrated it to the newest branch. The commit in this PR is still valid, but in the intervening time cuDF was upgraded with this subsequent commit: Which upgraded the cuDF dependency to So, perhaps we can remove this backwards compatibility shim now, and not serialize this PR behind another upgrade cycle?
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that's correct. We should be clear to remove this compatibility shim now that we are pinning a newer cudf commit. |
||
| #include <cudf/column/column_stream.hpp> | ||
| #define VELOX_CUDF_HAS_COLUMN_REBIND_STREAM 1 | ||
| #else | ||
| #define VELOX_CUDF_HAS_COLUMN_REBIND_STREAM 0 | ||
| #endif | ||
|
|
||
| namespace facebook::velox::cudf_velox { | ||
| namespace { | ||
|
|
||
|
|
@@ -175,6 +182,32 @@ std::unique_ptr<cudf::table> CudfVector::release() { | |
| return materializedTable; | ||
| } | ||
|
|
||
| bool CudfVector::rebindStream(rmm::cuda_stream_view stream) { | ||
| if (stream_.value() == stream.value()) { | ||
| return true; | ||
| } | ||
|
|
||
| #if VELOX_CUDF_HAS_COLUMN_REBIND_STREAM | ||
| if (auto* tablePtr = | ||
| std::get_if<std::unique_ptr<cudf::table>>(&tableStorage_)) { | ||
| if (!*tablePtr) { | ||
|
Comment on lines
+191
to
+193
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should we ask libcudf for rebind_stream for packed_table or packed_column too?
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If yes; It is possible to implement it here. But I suggest to ask libcudf to provide one.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, that's a good observation. I recommend that we implement that in libcudf.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you! It seems that this is a miss in the proposed PR, we don't handle bool CudfVector::rebindStream(rmm::cuda_stream_view stream) {
if (auto* tablePtr =
std::get_if<std::unique_ptr<cudf::table>>(&tableStorage_)) {
if (!*tablePtr) {
return false;
}
if (stream_.value() == stream.value()) {
return true;
}
#if VELOX_CUDF_HAS_COLUMN_REBIND_STREAM
auto columns = (*tablePtr)->release();
for (auto& column : columns) {
column = cudf::rebind_stream(std::move(*column), stream);
}
*tablePtr = std::make_unique<cudf::table>(std::move(columns));
tabView_ = (*tablePtr)->view();
stream_ = stream;
return true;
#else
return false;
#endif
}
if (auto* packedPtr =
std::get_if<std::unique_ptr<cudf::packed_table>>(&tableStorage_)) {
if (!*packedPtr) {
return false;
}
(*packedPtr)->data.gpu_data->set_stream(stream);
stream_ = stream;
return true;
}
return false;
}To ensure this is correct, I created two new test cases,
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (Side note, see the other discussion, maybe we ran remove the
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Great idea. Yes, this seems reasonable. And yes, please add those tests. As noted above, we can assume |
||
| return false; | ||
| } | ||
|
|
||
| auto columns = (*tablePtr)->release(); | ||
| for (auto& column : columns) { | ||
| column = cudf::rebind_stream(std::move(*column), stream); | ||
| } | ||
|
|
||
| *tablePtr = std::make_unique<cudf::table>(std::move(columns)); | ||
| tabView_ = (*tablePtr)->view(); | ||
|
kjmph marked this conversation as resolved.
|
||
| stream_ = stream; | ||
| return true; | ||
| } | ||
| #endif | ||
| return false; | ||
| } | ||
|
|
||
| uint64_t CudfVector::estimateFlatSize() const { | ||
| return flatSize_; | ||
| } | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @bdice @vyasr please suggest an idea here.
Event creation is costly. The join_stream in cudf uses a
thread_local staticto avoid creating repeatedly.https://github.com/rapidsai/cudf/blob/3b337d749d13bc8ab3a481cf09d1b5a66dd136c9/cpp/src/utilities/stream_pool.cpp#L96
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we just want to do the same thing for Velox that we're doing in libcudf? Seems like a similar use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you like me to propose a version that does this? In the same PR branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The libcudf solution seems fine. I don't see any issues using a TLS event for this use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think so. Let's adopt that TLS event solution here.