Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cuda.cooperative] Add missing overloads to block.reduce and block.sum #2691

Merged

Conversation

brycelelbach
Copy link
Collaborator

@brycelelbach brycelelbach commented Nov 4, 2024

Description

cub::BlockReduce cooperative methods have the following overloads, but not all were exposed in cuda.cooperative.

  • T Reduce(T, Op);
  • T Reduce(T, Op, int num_valid); (added in this PR)
  • T Reduce(T(&)[ITEMS_PER_THREAD], Op); (added in this PR)
  • T Sum(T, );
  • T Sum(T, int num_valid);
  • T Sum(T(&)[ITEMS_PER_THREAD]); (added in this PR)

This PR fixes that.

cub::BlockReduce doesn't take a class-level items-per-thread template parameter, but the array overloads do. So, I had to add an items-per-thread parameter to block.reduce and block.sum. I gave it a default of 1. I recall having a bug at one point where I forgot to add items-per-thread and I think it led to elements in the input arrays being silently ignored.

Checklist

  • Add tests.
  • Check what happens if you call the array overloads with an array larger than the specify items-per-thread; if this silently ignores elements, we should strive for a diagnostic and consider the implications of having items-per-thread defaulted to 1.
  • Update the cuda.cooperative documentation to list the cooperative overloads of all algorithms; currently none are listed (this is not an issue for just reduce).

@brycelelbach brycelelbach requested a review from a team as a code owner November 4, 2024 14:27
Copy link

copy-pr-bot bot commented Nov 4, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@brycelelbach brycelelbach force-pushed the pr/cuda.cooperative/reduce_overloads branch from d765f05 to c72163d Compare November 4, 2024 14:43
@brycelelbach brycelelbach changed the title [cuda.cooperative] Add missing overloads to block.Reduce and block.Sum [cuda.cooperative] Add missing overloads to block.reduce and block.sum Nov 4, 2024
@leofang
Copy link
Member

leofang commented Nov 4, 2024

cc @rwgk @emcastillo for vis

@brycelelbach brycelelbach force-pushed the pr/cuda.cooperative/reduce_overloads branch from c72163d to 43aa575 Compare February 1, 2025 00:28
@brycelelbach brycelelbach requested a review from a team as a code owner February 1, 2025 00:28
@brycelelbach brycelelbach force-pushed the pr/cuda.cooperative/reduce_overloads branch from 68194ee to a1f190b Compare February 2, 2025 18:50
@brycelelbach
Copy link
Collaborator Author

This should be good to go now.

@leofang
Copy link
Member

leofang commented Feb 3, 2025

/ok to test

Copy link
Contributor

github-actions bot commented Feb 3, 2025

🟩 CI finished in 26m 12s: Pass: 100%/1 | Total: 26m 12s | Avg: 26m 12s | Max: 26m 12s
  • 🟩 python: Pass: 100%/1 | Total: 26m 12s | Avg: 26m 12s | Max: 26m 12s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 26m 12s | Avg: 26m 12s | Max: 26m 12s
    🟩 ctk
      🟩 12.8               Pass: 100%/1   | Total: 26m 12s | Avg: 26m 12s | Max: 26m 12s
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/1   | Total: 26m 12s | Avg: 26m 12s | Max: 26m 12s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 26m 12s | Avg: 26m 12s | Max: 26m 12s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 26m 12s | Avg: 26m 12s | Max: 26m 12s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 26m 12s | Avg: 26m 12s | Max: 26m 12s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/1   | Total: 26m 12s | Avg: 26m 12s | Max: 26m 12s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 26m 12s | Avg: 26m 12s | Max: 26m 12s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 1)

# Runner
1 linux-amd64-gpu-rtx2080-latest-1

Copy link
Contributor

@shwina shwina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, pending a couple of minor pieces of feedback.

@brycelelbach brycelelbach force-pushed the pr/cuda.cooperative/reduce_overloads branch from b57a4f4 to 6ac50fe Compare February 5, 2025 18:26
@shwina
Copy link
Contributor

shwina commented Feb 5, 2025

/ok to test

Copy link
Contributor

github-actions bot commented Feb 5, 2025

🟩 CI finished in 34m 47s: Pass: 100%/1 | Total: 34m 47s | Avg: 34m 47s | Max: 34m 47s
  • 🟩 python: Pass: 100%/1 | Total: 34m 47s | Avg: 34m 47s | Max: 34m 47s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 34m 47s | Avg: 34m 47s | Max: 34m 47s
    🟩 ctk
      🟩 12.8               Pass: 100%/1   | Total: 34m 47s | Avg: 34m 47s | Max: 34m 47s
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/1   | Total: 34m 47s | Avg: 34m 47s | Max: 34m 47s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 34m 47s | Avg: 34m 47s | Max: 34m 47s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 34m 47s | Avg: 34m 47s | Max: 34m 47s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 34m 47s | Avg: 34m 47s | Max: 34m 47s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/1   | Total: 34m 47s | Avg: 34m 47s | Max: 34m 47s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 34m 47s | Avg: 34m 47s | Max: 34m 47s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 1)

# Runner
1 linux-amd64-gpu-rtx2080-latest-1

Copy link
Contributor

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

I don't have a lot of relevant specific background, but I looked through pretty carefully. The only thing that stood out to me is that the test code seems to be very repetitive. It would be nice to reduce that, but maybe later.

I also asked ChatGPT for advice regarding the repetitive code, response here, JIC it's useful later:


You said:
Do you have suggestions for reducing the copy-pasted code in python/cuda_cooperative/tests/test_block_reduce.py?
ChatGPT said:
ChatGPT
Yes! The test file python/cuda_cooperative/tests/test_block_reduce.py appears to have repetitive code patterns that could be refactored to improve maintainability. Here are some suggestions:

  1. Use a Parameterized Test Framework (pytest.mark.parametrize)
    Instead of duplicating similar test cases, you can use pytest.mark.parametrize to run a single function with multiple inputs.

Example Before (Repetitive Tests)
python
Copy
Edit
def test_block_reduce_int():
result = cuda_cooperative.block.reduce(lambda x, y: x + y, 10)
assert result == expected_result

def test_block_reduce_float():
result = cuda_cooperative.block.reduce(lambda x, y: x + y, 10.5)
assert result == expected_result
Refactored (Using pytest.mark.parametrize)
python
Copy
Edit
import pytest

@pytest.mark.parametrize("input_value, expected", [
(10, expected_result_int),
(10.5, expected_result_float),
])
def test_block_reduce(input_value, expected):
result = cuda_cooperative.block.reduce(lambda x, y: x + y, input_value)
assert result == expected
This avoids duplicating the same test logic while keeping it flexible.

  1. Abstract the Common Test Logic Into a Helper Function
    If there is repeated logic in multiple test functions, consider refactoring it into a helper function.

Example Before (Repeated Logic)
python
Copy
Edit
def test_block_reduce_int():
input_data = [1, 2, 3, 4]
expected = sum(input_data)
result = cuda_cooperative.block.reduce(lambda x, y: x + y, input_data)
assert result == expected

def test_block_reduce_float():
input_data = [1.1, 2.2, 3.3, 4.4]
expected = sum(input_data)
result = cuda_cooperative.block.reduce(lambda x, y: x + y, input_data)
assert result == expected
Refactored (Using a Helper Function)
python
Copy
Edit
def run_block_reduce_test(input_data):
expected = sum(input_data)
result = cuda_cooperative.block.reduce(lambda x, y: x + y, input_data)
assert result == expected

def test_block_reduce_int():
run_block_reduce_test([1, 2, 3, 4])

def test_block_reduce_float():
run_block_reduce_test([1.1, 2.2, 3.3, 4.4])
Now, all tests share the same logic while remaining readable.

  1. Use Fixtures for Setup Code
    If the tests involve setting up CUDA blocks or other common components, you can use pytest fixtures.

Example Before (Repeated Setup Code)
python
Copy
Edit
def test_block_reduce_with_setup():
setup_cuda_block()
result = cuda_cooperative.block.reduce(lambda x, y: x + y, [1, 2, 3, 4])
assert result == 10
Refactored (Using a Fixture)
python
Copy
Edit
@pytest.fixture
def cuda_block_setup():
setup_cuda_block() # Runs before each test that uses it

def test_block_reduce_with_fixture(cuda_block_setup):
result = cuda_cooperative.block.reduce(lambda x, y: x + y, [1, 2, 3, 4])
assert result == 10
This removes unnecessary duplication while keeping test setup organized.

  1. Use Loops for Similar Assertions
    If multiple assertions follow the same pattern, iterate over them instead.

Example Before (Multiple Assertions)
python
Copy
Edit
def test_block_reduce_multiple():
assert cuda_cooperative.block.reduce(lambda x, y: x + y, [1, 2]) == 3
assert cuda_cooperative.block.reduce(lambda x, y: x + y, [1, 2, 3]) == 6
assert cuda_cooperative.block.reduce(lambda x, y: x + y, [1, 2, 3, 4]) == 10
Refactored (Loop Over Assertions)
python
Copy
Edit
@pytest.mark.parametrize("input_data, expected", [
([1, 2], 3),
([1, 2, 3], 6),
([1, 2, 3, 4], 10),
])
def test_block_reduce_multiple(input_data, expected):
assert cuda_cooperative.block.reduce(lambda x, y: x + y, input_data) == expected
This keeps the test cases structured while reducing redundancy.

Conclusion
By applying these techniques—using pytest.mark.parametrize, helper functions, fixtures, and loops—you can significantly reduce copy-pasted code in test_block_reduce.py, making it easier to maintain and extend in the future.

Would you like help applying these changes to the PR? 🚀

@bernhardmgruber bernhardmgruber merged commit 1c792ab into NVIDIA:main Feb 6, 2025
16 of 19 checks passed
@brycelelbach brycelelbach deleted the pr/cuda.cooperative/reduce_overloads branch February 22, 2025 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

6 participants