Skip to content

Conversation

@TomAugspurger
Copy link
Contributor

This updates rmpf_worker_setup to reuse the current device resource (i.e. not layer a new resource on top) when:

  1. It's already an RmmResourceAdaptor
  2. It's a rmm.mr.StatisticsResourceAdaptor whose upstream MR is an RmmResourceAdaptor

I think that supporting 1 should be safe. I'm hoping that supporting number 2 is also safe, but there are a couple spots (our Statistics and LimitAvailableMemory) that really do want a RmmResourceAdaptor. For now I've extracted the RmmResourceAdaptor from the StatisticsResourceAdaptor. I'm guessing that any allocations made directly against that RmmResourceAdaptor will throw off the RMM statistics.

Closes #523

@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 24, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@TomAugspurger
Copy link
Contributor Author

A bit more on the "rmm.mr.StatisticsResourceAdaptor whose upstream MR is an RmmResourceAdaptor" case: rather than extracting the RmmResourceAdaptor from the StatisticsResourceAdaptor, perhaps the spots we require an RmmResourceAdaptor could instead take RmmResourceAdaptor | StatisticsResourceAdaptor[RmmResourceAdaptor] (i.e. a stats adaptor wrapping the resource adaptor).

We still wouldn't attempt to support arbitrary device resources, which perhaps limits to scope to a point that makes it feasible? cc @madsbk if you have any thoughts on that.

Copy link
Member

@madsbk madsbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your current approach, retrieving the RmmResourceAdaptor from the stack, is good!

But consider implementing get_rmm_memory_resource_stack() to retrieve the current RmmResourceAdaptor and if none exists, create (and push) a new adaptor on the top of the stack.

@TomAugspurger TomAugspurger added the non-breaking Introduces a non-breaking change label Sep 24, 2025
@TomAugspurger TomAugspurger marked this pull request as ready for review September 24, 2025 14:16
@TomAugspurger TomAugspurger requested a review from a team as a code owner September 24, 2025 14:17
@TomAugspurger TomAugspurger added the improvement Improves an existing functionality label Sep 24, 2025
Copy link
Member

@madsbk madsbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Sep 24, 2025

Hmm, I guess a test is hanging: https://github.com/rapidsai/rapidsmpf/actions/runs/17982528193/job/51154821364?pr=528#step:10:2118... Trying to reproduce locally.

Edit: yep, able to reproduce here (and not on main) :/

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Sep 24, 2025

Interestingly, I can only reproduce the hang in python/rapidsmpf/rapidsmpf/tests/streaming/test_define_py_node.py::test_send_table_chunks after running the stats-pool-cuda case from the new test_rmpf_worker_setup_memory_resource test. If I exclude that case, there's no hang.

That's the case that layers

        mr = rmm.mr.StatisticsResourceAdaptor(
            rapidsmpf.rmm_resource_adaptor.RmmResourceAdaptor(
                rmm.mr.PoolMemoryResource(device_mr)
            )
        )

I'm looking into why this case specifically triggers the subsequent hang.

@TomAugspurger
Copy link
Contributor Author

rapidsai/cudf@e4391fc works around this issue. Previously, I was using the rmm.statistics.statistics context manager, but that errored on __exit__ after rapidsmpf changed the device memory resource. That commit updates it to use StatisticsResourceAdaptor in cudf-polars. rapidsmpf will see that and layer its own RmmResourceAdaptor on top, which should be fine.

I'll think a bit more about whether this is worth fixing, but for now I think the workaround will be fine.

@TomAugspurger
Copy link
Contributor Author

I can avoid the hang if I change rmpf_worker_setup to use resource_adaptor instead of the top-level mr a few more places (as a reminder: on main we just one memory resource, the RmmResourceAdaptor we create; now we potentially have two: 1.) the mr from mr = rmm.mr.get_current_device_resource() and resource_adaptor, the RmmResourceAdaptor somewhere in that stack):

  1. The MR provided to the Worker Context's BufferResource here.
  2. The MR provided to the spill_staging_buffer DeviceBuffer here.
  3. The MR provided to the spill_manager.add_spill_function function here.

That's essentially bypassing everything from rmm.mr.get_current_device_resource() that's downstream of the RmmResourceAdaptor, which I don't think we want.

I might step back a bit and try to solve this a different way.

@pentschev pentschev changed the base branch from branch-25.10 to branch-25.12 September 25, 2025 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support user-provided Memory Resources

3 participants