-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[cuda.cooperative] Add missing overloads to block.reduce and block.sum #2691
[cuda.cooperative] Add missing overloads to block.reduce and block.sum #2691
Conversation
d765f05
to
c72163d
Compare
cc @rwgk @emcastillo for vis |
c72163d
to
43aa575
Compare
68194ee
to
a1f190b
Compare
This should be good to go now. |
/ok to test |
🟩 CI finished in 26m 12s: Pass: 100%/1 | Total: 26m 12s | Avg: 26m 12s | Max: 26m 12s
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
CCCL C Parallel Library | |
Catch2Helper |
🏃 Runner counts (total jobs: 1)
# | Runner |
---|---|
1 | linux-amd64-gpu-rtx2080-latest-1 |
python/cuda_cooperative/cuda/cooperative/experimental/block/_block_reduce.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving, pending a couple of minor pieces of feedback.
…, array input) and block.sum (array input).
…e signatures will match for arrays. - Update documentation to indicate what signatures the produced collective objects support. - Add a missing import of `DependentArray`.
…on tests to check that we aren't reducing the entire block.
b57a4f4
to
6ac50fe
Compare
/ok to test |
🟩 CI finished in 34m 47s: Pass: 100%/1 | Total: 34m 47s | Avg: 34m 47s | Max: 34m 47s
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
+/- | python |
CCCL C Parallel Library | |
Catch2Helper |
🏃 Runner counts (total jobs: 1)
# | Runner |
---|---|
1 | linux-amd64-gpu-rtx2080-latest-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me.
I don't have a lot of relevant specific background, but I looked through pretty carefully. The only thing that stood out to me is that the test code seems to be very repetitive. It would be nice to reduce that, but maybe later.
I also asked ChatGPT for advice regarding the repetitive code, response here, JIC it's useful later:
You said:
Do you have suggestions for reducing the copy-pasted code in python/cuda_cooperative/tests/test_block_reduce.py?
ChatGPT said:
ChatGPT
Yes! The test file python/cuda_cooperative/tests/test_block_reduce.py appears to have repetitive code patterns that could be refactored to improve maintainability. Here are some suggestions:
- Use a Parameterized Test Framework (pytest.mark.parametrize)
Instead of duplicating similar test cases, you can use pytest.mark.parametrize to run a single function with multiple inputs.
Example Before (Repetitive Tests)
python
Copy
Edit
def test_block_reduce_int():
result = cuda_cooperative.block.reduce(lambda x, y: x + y, 10)
assert result == expected_result
def test_block_reduce_float():
result = cuda_cooperative.block.reduce(lambda x, y: x + y, 10.5)
assert result == expected_result
Refactored (Using pytest.mark.parametrize)
python
Copy
Edit
import pytest
@pytest.mark.parametrize("input_value, expected", [
(10, expected_result_int),
(10.5, expected_result_float),
])
def test_block_reduce(input_value, expected):
result = cuda_cooperative.block.reduce(lambda x, y: x + y, input_value)
assert result == expected
This avoids duplicating the same test logic while keeping it flexible.
- Abstract the Common Test Logic Into a Helper Function
If there is repeated logic in multiple test functions, consider refactoring it into a helper function.
Example Before (Repeated Logic)
python
Copy
Edit
def test_block_reduce_int():
input_data = [1, 2, 3, 4]
expected = sum(input_data)
result = cuda_cooperative.block.reduce(lambda x, y: x + y, input_data)
assert result == expected
def test_block_reduce_float():
input_data = [1.1, 2.2, 3.3, 4.4]
expected = sum(input_data)
result = cuda_cooperative.block.reduce(lambda x, y: x + y, input_data)
assert result == expected
Refactored (Using a Helper Function)
python
Copy
Edit
def run_block_reduce_test(input_data):
expected = sum(input_data)
result = cuda_cooperative.block.reduce(lambda x, y: x + y, input_data)
assert result == expected
def test_block_reduce_int():
run_block_reduce_test([1, 2, 3, 4])
def test_block_reduce_float():
run_block_reduce_test([1.1, 2.2, 3.3, 4.4])
Now, all tests share the same logic while remaining readable.
- Use Fixtures for Setup Code
If the tests involve setting up CUDA blocks or other common components, you can use pytest fixtures.
Example Before (Repeated Setup Code)
python
Copy
Edit
def test_block_reduce_with_setup():
setup_cuda_block()
result = cuda_cooperative.block.reduce(lambda x, y: x + y, [1, 2, 3, 4])
assert result == 10
Refactored (Using a Fixture)
python
Copy
Edit
@pytest.fixture
def cuda_block_setup():
setup_cuda_block() # Runs before each test that uses it
def test_block_reduce_with_fixture(cuda_block_setup):
result = cuda_cooperative.block.reduce(lambda x, y: x + y, [1, 2, 3, 4])
assert result == 10
This removes unnecessary duplication while keeping test setup organized.
- Use Loops for Similar Assertions
If multiple assertions follow the same pattern, iterate over them instead.
Example Before (Multiple Assertions)
python
Copy
Edit
def test_block_reduce_multiple():
assert cuda_cooperative.block.reduce(lambda x, y: x + y, [1, 2]) == 3
assert cuda_cooperative.block.reduce(lambda x, y: x + y, [1, 2, 3]) == 6
assert cuda_cooperative.block.reduce(lambda x, y: x + y, [1, 2, 3, 4]) == 10
Refactored (Loop Over Assertions)
python
Copy
Edit
@pytest.mark.parametrize("input_data, expected", [
([1, 2], 3),
([1, 2, 3], 6),
([1, 2, 3, 4], 10),
])
def test_block_reduce_multiple(input_data, expected):
assert cuda_cooperative.block.reduce(lambda x, y: x + y, input_data) == expected
This keeps the test cases structured while reducing redundancy.
Conclusion
By applying these techniques—using pytest.mark.parametrize, helper functions, fixtures, and loops—you can significantly reduce copy-pasted code in test_block_reduce.py, making it easier to maintain and extend in the future.
Would you like help applying these changes to the PR? 🚀
Description
cub::BlockReduce
cooperative methods have the following overloads, but not all were exposed incuda.cooperative
.T Reduce(T, Op);
T Reduce(T, Op, int num_valid);
(added in this PR)T Reduce(T(&)[ITEMS_PER_THREAD], Op);
(added in this PR)T Sum(T, );
T Sum(T, int num_valid);
T Sum(T(&)[ITEMS_PER_THREAD]);
(added in this PR)This PR fixes that.
cub::BlockReduce
doesn't take a class-level items-per-thread template parameter, but the array overloads do. So, I had to add an items-per-thread parameter toblock.reduce
andblock.sum
. I gave it a default of 1. I recall having a bug at one point where I forgot to add items-per-thread and I think it led to elements in the input arrays being silently ignored.Checklist
cuda.cooperative
documentation to list the cooperative overloads of all algorithms; currently none are listed (this is not an issue for just reduce).