[cuda.cooperative] Add missing overloads to block.reduce and block.sum #2691

brycelelbach · 2024-11-04T14:27:40Z

Description

cub::BlockReduce cooperative methods have the following overloads, but not all were exposed in cuda.cooperative.

T Reduce(T, Op);
T Reduce(T, Op, int num_valid); (added in this PR)
T Reduce(T(&)[ITEMS_PER_THREAD], Op); (added in this PR)
T Sum(T, );
T Sum(T, int num_valid);
T Sum(T(&)[ITEMS_PER_THREAD]); (added in this PR)

This PR fixes that.

cub::BlockReduce doesn't take a class-level items-per-thread template parameter, but the array overloads do. So, I had to add an items-per-thread parameter to block.reduce and block.sum. I gave it a default of 1. I recall having a bug at one point where I forgot to add items-per-thread and I think it led to elements in the input arrays being silently ignored.

Checklist

Add tests.
Check what happens if you call the array overloads with an array larger than the specify items-per-thread; if this silently ignores elements, we should strive for a diagnostic and consider the implications of having items-per-thread defaulted to 1.
Update the cuda.cooperative documentation to list the cooperative overloads of all algorithms; currently none are listed (this is not an issue for just reduce).

copy-pr-bot · 2024-11-04T14:27:44Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

leofang · 2024-11-04T16:31:40Z

cc @rwgk @emcastillo for vis

brycelelbach · 2025-02-02T18:52:05Z

This should be good to go now.

leofang · 2025-02-03T01:18:52Z

/ok to test

github-actions · 2025-02-03T01:47:12Z

🟩 CI finished in 26m 12s: Pass: 100%/1 | Total: 26m 12s | Avg: 26m 12s | Max: 26m 12s

🟩 python: Pass: 100%/1 | Total: 26m 12s | Avg: 26m 12s | Max: 26m 12s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 26m 12s | Avg: 26m 12s | Max: 26m 12s
🟩 ctk
  🟩 12.8               Pass: 100%/1   | Total: 26m 12s | Avg: 26m 12s | Max: 26m 12s
🟩 cudacxx
  🟩 nvcc12.8           Pass: 100%/1   | Total: 26m 12s | Avg: 26m 12s | Max: 26m 12s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 26m 12s | Avg: 26m 12s | Max: 26m 12s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 26m 12s | Avg: 26m 12s | Max: 26m 12s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 26m 12s | Avg: 26m 12s | Max: 26m 12s
🟩 gpu
  🟩 rtx2080            Pass: 100%/1   | Total: 26m 12s | Avg: 26m 12s | Max: 26m 12s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 26m 12s | Avg: 26m 12s | Max: 26m 12s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 1)

#	Runner
1	`linux-amd64-gpu-rtx2080-latest-1`

python/cuda_cooperative/cuda/cooperative/experimental/block/_block_reduce.py

python/cuda_cooperative/tests/test_block_reduce.py

shwina

Approving, pending a couple of minor pieces of feedback.

…, array input) and block.sum (array input).

…e signatures will match for arrays. - Update documentation to indicate what signatures the produced collective objects support. - Add a missing import of `DependentArray`.

…on tests to check that we aren't reducing the entire block.

shwina · 2025-02-05T20:28:18Z

/ok to test

github-actions · 2025-02-05T21:12:08Z

🟩 CI finished in 34m 47s: Pass: 100%/1 | Total: 34m 47s | Avg: 34m 47s | Max: 34m 47s

🟩 python: Pass: 100%/1 | Total: 34m 47s | Avg: 34m 47s | Max: 34m 47s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 34m 47s | Avg: 34m 47s | Max: 34m 47s
🟩 ctk
  🟩 12.8               Pass: 100%/1   | Total: 34m 47s | Avg: 34m 47s | Max: 34m 47s
🟩 cudacxx
  🟩 nvcc12.8           Pass: 100%/1   | Total: 34m 47s | Avg: 34m 47s | Max: 34m 47s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 34m 47s | Avg: 34m 47s | Max: 34m 47s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 34m 47s | Avg: 34m 47s | Max: 34m 47s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 34m 47s | Avg: 34m 47s | Max: 34m 47s
🟩 gpu
  🟩 rtx2080            Pass: 100%/1   | Total: 34m 47s | Avg: 34m 47s | Max: 34m 47s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 34m 47s | Avg: 34m 47s | Max: 34m 47s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 1)

#	Runner
1	`linux-amd64-gpu-rtx2080-latest-1`

rwgk

This looks good to me.

I don't have a lot of relevant specific background, but I looked through pretty carefully. The only thing that stood out to me is that the test code seems to be very repetitive. It would be nice to reduce that, but maybe later.

I also asked ChatGPT for advice regarding the repetitive code, response here, JIC it's useful later:

You said:
Do you have suggestions for reducing the copy-pasted code in python/cuda_cooperative/tests/test_block_reduce.py?
ChatGPT said:
ChatGPT
Yes! The test file python/cuda_cooperative/tests/test_block_reduce.py appears to have repetitive code patterns that could be refactored to improve maintainability. Here are some suggestions:

Use a Parameterized Test Framework (pytest.mark.parametrize)
Instead of duplicating similar test cases, you can use pytest.mark.parametrize to run a single function with multiple inputs.

Example Before (Repetitive Tests)
python
Copy
Edit
def test_block_reduce_int():
result = cuda_cooperative.block.reduce(lambda x, y: x + y, 10)
assert result == expected_result

def test_block_reduce_float():
result = cuda_cooperative.block.reduce(lambda x, y: x + y, 10.5)
assert result == expected_result
Refactored (Using pytest.mark.parametrize)
python
Copy
Edit
import pytest

@pytest.mark.parametrize("input_value, expected", [
(10, expected_result_int),
(10.5, expected_result_float),
])
def test_block_reduce(input_value, expected):
result = cuda_cooperative.block.reduce(lambda x, y: x + y, input_value)
assert result == expected
This avoids duplicating the same test logic while keeping it flexible.

Abstract the Common Test Logic Into a Helper Function
If there is repeated logic in multiple test functions, consider refactoring it into a helper function.

Example Before (Repeated Logic)
python
Copy
Edit
def test_block_reduce_int():
input_data = [1, 2, 3, 4]
expected = sum(input_data)
result = cuda_cooperative.block.reduce(lambda x, y: x + y, input_data)
assert result == expected

def test_block_reduce_float():
input_data = [1.1, 2.2, 3.3, 4.4]
expected = sum(input_data)
result = cuda_cooperative.block.reduce(lambda x, y: x + y, input_data)
assert result == expected
Refactored (Using a Helper Function)
python
Copy
Edit
def run_block_reduce_test(input_data):
expected = sum(input_data)
result = cuda_cooperative.block.reduce(lambda x, y: x + y, input_data)
assert result == expected

def test_block_reduce_int():
run_block_reduce_test([1, 2, 3, 4])

def test_block_reduce_float():
run_block_reduce_test([1.1, 2.2, 3.3, 4.4])
Now, all tests share the same logic while remaining readable.

Use Fixtures for Setup Code
If the tests involve setting up CUDA blocks or other common components, you can use pytest fixtures.

Example Before (Repeated Setup Code)
python
Copy
Edit
def test_block_reduce_with_setup():
setup_cuda_block()
result = cuda_cooperative.block.reduce(lambda x, y: x + y, [1, 2, 3, 4])
assert result == 10
Refactored (Using a Fixture)
python
Copy
Edit
@pytest.fixture
def cuda_block_setup():
setup_cuda_block() # Runs before each test that uses it

def test_block_reduce_with_fixture(cuda_block_setup):
result = cuda_cooperative.block.reduce(lambda x, y: x + y, [1, 2, 3, 4])
assert result == 10
This removes unnecessary duplication while keeping test setup organized.

Use Loops for Similar Assertions
If multiple assertions follow the same pattern, iterate over them instead.

Example Before (Multiple Assertions)
python
Copy
Edit
def test_block_reduce_multiple():
assert cuda_cooperative.block.reduce(lambda x, y: x + y, [1, 2]) == 3
assert cuda_cooperative.block.reduce(lambda x, y: x + y, [1, 2, 3]) == 6
assert cuda_cooperative.block.reduce(lambda x, y: x + y, [1, 2, 3, 4]) == 10
Refactored (Loop Over Assertions)
python
Copy
Edit
@pytest.mark.parametrize("input_data, expected", [
([1, 2], 3),
([1, 2, 3], 6),
([1, 2, 3, 4], 10),
])
def test_block_reduce_multiple(input_data, expected):
assert cuda_cooperative.block.reduce(lambda x, y: x + y, input_data) == expected
This keeps the test cases structured while reducing redundancy.

Conclusion
By applying these techniques—using pytest.mark.parametrize, helper functions, fixtures, and loops—you can significantly reduce copy-pasted code in test_block_reduce.py, making it easier to maintain and extend in the future.

Would you like help applying these changes to the PR? 🚀

brycelelbach requested a review from a team as a code owner November 4, 2024 14:27

brycelelbach requested a review from bernhardmgruber November 4, 2024 14:27

brycelelbach assigned gevtushenko Nov 4, 2024

brycelelbach force-pushed the pr/cuda.cooperative/reduce_overloads branch from d765f05 to c72163d Compare November 4, 2024 14:43

brycelelbach changed the title ~~[cuda.cooperative] Add missing overloads to block.Reduce and block.Sum~~ [cuda.cooperative] Add missing overloads to block.reduce and block.sum Nov 4, 2024

brycelelbach requested a review from gevtushenko December 13, 2024 19:51

brycelelbach force-pushed the pr/cuda.cooperative/reduce_overloads branch from c72163d to 43aa575 Compare February 1, 2025 00:28

brycelelbach requested a review from a team as a code owner February 1, 2025 00:28

brycelelbach force-pushed the pr/cuda.cooperative/reduce_overloads branch from 68194ee to a1f190b Compare February 2, 2025 18:50

shwina reviewed Feb 3, 2025

View reviewed changes

python/cuda_cooperative/cuda/cooperative/experimental/block/_block_reduce.py Show resolved Hide resolved

shwina reviewed Feb 3, 2025

View reviewed changes

python/cuda_cooperative/tests/test_block_reduce.py Show resolved Hide resolved

shwina approved these changes Feb 3, 2025

View reviewed changes

brycelelbach added 8 commits February 5, 2025 10:26

[cuda.cooperative] Add missing overloads to block.reduce (valid items…

7252d9a

…, array input) and block.sum (array input).

- Order array signatures before reference signatures because referenc…

c364d3f

…e signatures will match for arrays. - Update documentation to indicate what signatures the produced collective objects support. - Add a missing import of `DependentArray`.

Add tests for array and valid item reduce overloads.

e0ca653

Fix up tags for reduce documentation code sample import.

5a5519f

Fix an if that should be an elif in the parameter codegen loop.

cfacde1

Run ruff.

5589c83

Make the last element of the test input 0 for the valid items reducti…

69f1622

…on tests to check that we aren't reducing the entire block.

Add documentation for reduce methods parameter.

6ac50fe

brycelelbach force-pushed the pr/cuda.cooperative/reduce_overloads branch from b57a4f4 to 6ac50fe Compare February 5, 2025 18:26

rwgk approved these changes Feb 5, 2025

View reviewed changes

bernhardmgruber merged commit 1c792ab into NVIDIA:main Feb 6, 2025
16 of 19 checks passed

shwina mentioned this pull request Feb 11, 2025

cuda.cooperative: Improve BlockScan implementation #3772

Open

brycelelbach deleted the pr/cuda.cooperative/reduce_overloads branch February 22, 2025 15:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cuda.cooperative] Add missing overloads to block.reduce and block.sum #2691

[cuda.cooperative] Add missing overloads to block.reduce and block.sum #2691

brycelelbach commented Nov 4, 2024 •

edited

Loading

copy-pr-bot bot commented Nov 4, 2024

leofang commented Nov 4, 2024 •

edited

Loading

brycelelbach commented Feb 2, 2025

leofang commented Feb 3, 2025

github-actions bot commented Feb 3, 2025

🟩 python: Pass: 100%/1 | Total: 26m 12s | Avg: 26m 12s | Max: 26m 12s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

shwina left a comment

shwina commented Feb 5, 2025

github-actions bot commented Feb 5, 2025

🟩 python: Pass: 100%/1 | Total: 34m 47s | Avg: 34m 47s | Max: 34m 47s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

rwgk left a comment

[cuda.cooperative] Add missing overloads to block.reduce and block.sum #2691

[cuda.cooperative] Add missing overloads to block.reduce and block.sum #2691

Conversation

brycelelbach commented Nov 4, 2024 • edited Loading

Description

Checklist

copy-pr-bot bot commented Nov 4, 2024

leofang commented Nov 4, 2024 • edited Loading

brycelelbach commented Feb 2, 2025

leofang commented Feb 3, 2025

github-actions bot commented Feb 3, 2025

🟩 python: Pass: 100%/1 | Total: 26m 12s | Avg: 26m 12s | Max: 26m 12s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

shwina left a comment

Choose a reason for hiding this comment

shwina commented Feb 5, 2025

github-actions bot commented Feb 5, 2025

🟩 python: Pass: 100%/1 | Total: 34m 47s | Avg: 34m 47s | Max: 34m 47s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

rwgk left a comment

Choose a reason for hiding this comment

brycelelbach commented Nov 4, 2024 •

edited

Loading

leofang commented Nov 4, 2024 •

edited

Loading