Reuse existing Memory Resource in `rmpf_worker_setup` when possible #528

TomAugspurger · 2025-09-24T11:25:47Z

This updates rmpf_worker_setup to reuse the current device resource (i.e. not layer a new resource on top) when:

It's already an RmmResourceAdaptor
It's a rmm.mr.StatisticsResourceAdaptor whose upstream MR is an RmmResourceAdaptor

I think that supporting 1 should be safe. I'm hoping that supporting number 2 is also safe, but there are a couple spots (our Statistics and LimitAvailableMemory) that really do want a RmmResourceAdaptor. For now I've extracted the RmmResourceAdaptor from the StatisticsResourceAdaptor. I'm guessing that any allocations made directly against that RmmResourceAdaptor will throw off the RMM statistics.

Closes #523

copy-pr-bot · 2025-09-24T11:25:51Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

TomAugspurger · 2025-09-24T11:32:45Z

A bit more on the "rmm.mr.StatisticsResourceAdaptor whose upstream MR is an RmmResourceAdaptor" case: rather than extracting the RmmResourceAdaptor from the StatisticsResourceAdaptor, perhaps the spots we require an RmmResourceAdaptor could instead take RmmResourceAdaptor | StatisticsResourceAdaptor[RmmResourceAdaptor] (i.e. a stats adaptor wrapping the resource adaptor).

We still wouldn't attempt to support arbitrary device resources, which perhaps limits to scope to a point that makes it feasible? cc @madsbk if you have any thoughts on that.

madsbk

I think your current approach, retrieving the RmmResourceAdaptor from the stack, is good!

But consider implementing get_rmm_memory_resource_stack() to retrieve the current RmmResourceAdaptor and if none exists, create (and push) a new adaptor on the top of the stack.

python/rapidsmpf/rapidsmpf/integrations/core.py

madsbk

Overall looks good

python/rapidsmpf/rapidsmpf/tests/integrations/test_core.py

python/rapidsmpf/rapidsmpf/statistics.pyx

python/rapidsmpf/rapidsmpf/integrations/core.py

…resource

python/rapidsmpf/rapidsmpf/tests/test_dask.py

TomAugspurger · 2025-09-24T17:24:00Z

Hmm, I guess a test is hanging: https://github.com/rapidsai/rapidsmpf/actions/runs/17982528193/job/51154821364?pr=528#step:10:2118... Trying to reproduce locally.

Edit: yep, able to reproduce here (and not on main) :/

TomAugspurger · 2025-09-24T18:59:46Z

Interestingly, I can only reproduce the hang in python/rapidsmpf/rapidsmpf/tests/streaming/test_define_py_node.py::test_send_table_chunks after running the stats-pool-cuda case from the new test_rmpf_worker_setup_memory_resource test. If I exclude that case, there's no hang.

That's the case that layers

        mr = rmm.mr.StatisticsResourceAdaptor(
            rapidsmpf.rmm_resource_adaptor.RmmResourceAdaptor(
                rmm.mr.PoolMemoryResource(device_mr)
            )
        )

I'm looking into why this case specifically triggers the subsequent hang.

TomAugspurger · 2025-09-24T19:47:58Z

rapidsai/cudf@e4391fc works around this issue. Previously, I was using the rmm.statistics.statistics context manager, but that errored on __exit__ after rapidsmpf changed the device memory resource. That commit updates it to use StatisticsResourceAdaptor in cudf-polars. rapidsmpf will see that and layer its own RmmResourceAdaptor on top, which should be fine.

I'll think a bit more about whether this is worth fixing, but for now I think the workaround will be fine.

TomAugspurger · 2025-09-24T19:53:58Z

I can avoid the hang if I change rmpf_worker_setup to use resource_adaptor instead of the top-level mr a few more places (as a reminder: on main we just one memory resource, the RmmResourceAdaptor we create; now we potentially have two: 1.) the mr from mr = rmm.mr.get_current_device_resource() and resource_adaptor, the RmmResourceAdaptor somewhere in that stack):

The MR provided to the Worker Context's BufferResource here.
The MR provided to the spill_staging_buffer DeviceBuffer here.
The MR provided to the spill_manager.add_spill_function function here.

That's essentially bypassing everything from rmm.mr.get_current_device_resource() that's downstream of the RmmResourceAdaptor, which I don't think we want.

I might step back a bit and try to solve this a different way.

Reuse existing MR when valid.

98f9e76

madsbk reviewed Sep 24, 2025

View reviewed changes

python/rapidsmpf/rapidsmpf/integrations/core.py Outdated Show resolved Hide resolved

Walk the RMM resource stack to find adaptors

77d5f60

TomAugspurger added the non-breaking Introduces a non-breaking change label Sep 24, 2025

TomAugspurger marked this pull request as ready for review September 24, 2025 14:16

TomAugspurger requested a review from a team as a code owner September 24, 2025 14:17

TomAugspurger added the improvement Improves an existing functionality label Sep 24, 2025

madsbk requested changes Sep 24, 2025

View reviewed changes

python/rapidsmpf/rapidsmpf/tests/integrations/test_core.py Outdated Show resolved Hide resolved

python/rapidsmpf/rapidsmpf/tests/integrations/test_core.py Outdated Show resolved Hide resolved

python/rapidsmpf/rapidsmpf/statistics.pyx Outdated Show resolved Hide resolved

madsbk reviewed Sep 24, 2025

View reviewed changes

python/rapidsmpf/rapidsmpf/integrations/core.py Outdated Show resolved Hide resolved

TomAugspurger added 3 commits September 24, 2025 08:02

Test fixes

ac51bf8

format

8a09c7f

doc fix

442709d

TomAugspurger mentioned this pull request Sep 24, 2025

Trace node execution in cudf-polars rapidsai/cudf#19895

Merged

3 tasks

Merge remote-tracking branch 'upstream/branch-25.10' into tom/memory-…

abdceb8

…resource

TomAugspurger commented Sep 24, 2025

View reviewed changes

python/rapidsmpf/rapidsmpf/tests/test_dask.py Outdated Show resolved Hide resolved

Use device_mr

4386505

pentschev changed the base branch from branch-25.10 to branch-25.12 September 25, 2025 14:39

Merge branch 'branch-25.12' into tom/memory-resource

ece4e98

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reuse existing Memory Resource in `rmpf_worker_setup` when possible #528

Reuse existing Memory Resource in `rmpf_worker_setup` when possible #528

Uh oh!

TomAugspurger commented Sep 24, 2025

Uh oh!

copy-pr-bot bot commented Sep 24, 2025

Uh oh!

TomAugspurger commented Sep 24, 2025

Uh oh!

madsbk left a comment

Uh oh!

Uh oh!

madsbk left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TomAugspurger commented Sep 24, 2025 •

edited

Loading

Uh oh!

TomAugspurger commented Sep 24, 2025 •

edited

Loading

Uh oh!

TomAugspurger commented Sep 24, 2025

Uh oh!

TomAugspurger commented Sep 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Reuse existing Memory Resource in rmpf_worker_setup when possible #528

Are you sure you want to change the base?

Reuse existing Memory Resource in rmpf_worker_setup when possible #528

Uh oh!

Conversation

TomAugspurger commented Sep 24, 2025

Uh oh!

copy-pr-bot bot commented Sep 24, 2025

Uh oh!

TomAugspurger commented Sep 24, 2025

Uh oh!

madsbk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

madsbk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TomAugspurger commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented Sep 24, 2025

Uh oh!

TomAugspurger commented Sep 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Reuse existing Memory Resource in `rmpf_worker_setup` when possible #528

Reuse existing Memory Resource in `rmpf_worker_setup` when possible #528

TomAugspurger commented Sep 24, 2025 •

edited

Loading

TomAugspurger commented Sep 24, 2025 •

edited

Loading