-
Notifications
You must be signed in to change notification settings - Fork 22
Reuse existing Memory Resource in rmpf_worker_setup when possible
#528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
A bit more on the " We still wouldn't attempt to support arbitrary device resources, which perhaps limits to scope to a point that makes it feasible? cc @madsbk if you have any thoughts on that. |
madsbk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your current approach, retrieving the RmmResourceAdaptor from the stack, is good!
But consider implementing get_rmm_memory_resource_stack() to retrieve the current RmmResourceAdaptor and if none exists, create (and push) a new adaptor on the top of the stack.
madsbk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good
|
Hmm, I guess a test is hanging: https://github.com/rapidsai/rapidsmpf/actions/runs/17982528193/job/51154821364?pr=528#step:10:2118... Trying to reproduce locally. Edit: yep, able to reproduce here (and not on |
|
Interestingly, I can only reproduce the hang in That's the case that layers I'm looking into why this case specifically triggers the subsequent hang. |
|
rapidsai/cudf@e4391fc works around this issue. Previously, I was using the I'll think a bit more about whether this is worth fixing, but for now I think the workaround will be fine. |
|
I can avoid the hang if I change
That's essentially bypassing everything from I might step back a bit and try to solve this a different way. |
This updates
rmpf_worker_setupto reuse the current device resource (i.e. not layer a new resource on top) when:RmmResourceAdaptorrmm.mr.StatisticsResourceAdaptorwhose upstream MR is anRmmResourceAdaptorI think that supporting 1 should be safe. I'm hoping that supporting number 2 is also safe, but there are a couple spots (our
StatisticsandLimitAvailableMemory) that really do want aRmmResourceAdaptor. For now I've extracted theRmmResourceAdaptorfrom theStatisticsResourceAdaptor. I'm guessing that any allocations made directly against thatRmmResourceAdaptorwill throw off the RMM statistics.Closes #523