Samples: Ported the MatrixMul example of CUDA samples #341

madhav-madhusoodanan · 2025-12-10T21:15:30Z

Context

Ported the matrixMul sample from NVIDIA/cuda-samples.

Relevant Issue

Port CUDA samples #238

madhav-madhusoodanan · 2025-12-10T21:39:13Z

Seems there are other modules that need formatting

nnethercote

Thanks for doing this! Generally looking good, though I have a few suggestions.

Also, can you squash the commits together? They conceptually belong together, and there's no point polluting version control history with all these incomplete intermediate revisions.

nnethercote · 2025-12-10T22:15:51Z

Cargo.toml


  "samples/introduction/async_api",
  "samples/introduction/async_api/kernels",



No need for a blank line here.

nnethercote · 2025-12-10T22:16:06Z

samples/introduction/README.md

 3. The CPU can query these events to check whether the GPU has finished its work, allowing for coordination between the two processors without blocking the CPU.
+
+## [matrixMul](https://github.com/Rust-GPU/rust-cuda/samples/introduction/matmul)
+This example demonstrates an example kernel implementation of matrix multiplicaation.


Suggested change

This example demonstrates an example kernel implementation of matrix multiplicaation.

This example demonstrates an example kernel implementation of matrix multiplication.

Thank you for pointing that out!

nnethercote · 2025-12-10T22:24:16Z

samples/introduction/matmul/kernels/src/lib.rs

+            // However, to improve numerical stability, we use Kahan summation here so that the error can be isolated
+            // and not allow it to accumulate in c_sub
+            unsafe {
+                let input = As[ty][k].assume_init() * Bs[k][tx].assume_init();


Can this unsafe block be shrunk to just be around the let input = unsafe { ... };?

nnethercote · 2025-12-10T22:28:21Z

samples/introduction/matmul/kernels/src/lib.rs

+
+// SAFETY: This function is unsafe because it dereferences raw pointers.
+#[kernel]
+pub unsafe fn matrix_mul_cuda(C: *mut f32, A: *const f32, B: *const f32, wa: usize, wb: usize) {


Probably better to name these c, a, b, given that upper-case names are normally used for statics. You'll need to rename some variables below as well, perhaps ai and bi would work.

c must be a raw pointer because it's modified. But a and b can be slices because they are const, which would be more idiomatic. Can you change them?

Won't this cause the below error? I'm running into that in the CI

error: `extern` fn uses type `[f32]`, which is not FFI-safe --> samples/introduction/matmul/kernels/src/lib.rs:6:47 | 6 | pub unsafe fn matrix_mul_cuda(c: *mut f32, a: &[f32], b: &[f32], wa: usize, wb: usize) { | ^^^^^^ not FFI-safe | = help: consider using a raw pointer instead = note: slices have no C equivalent

nnethercote · 2025-12-10T22:33:16Z

samples/introduction/matmul/src/main.rs

+    let device = Device::get_device(0).expect("Couldn't find Cuda supported devices!");
+    println!("Device Name: {}", device.name().unwrap());
+
+    let block_size: u32 = 32;


This could be a const BLOCK_SIZE: u32 = 32;

nnethercote · 2025-12-10T22:55:45Z

samples/introduction/matmul/src/main.rs

+
+    let block_size: u32 = 32;
+    let dims_a: (usize, usize, usize) = (40 * block_size as usize, 40 * block_size as usize, 1);
+    let dims_b: (usize, usize, usize) = (80 * block_size as usize, 40 * block_size as usize, 1);


I don't like the use of a tuple for the dimensions. You could instead use cuda_std::glam::UsizeVec2. That would let you use .x and .y in matrix_mul_cuda, and it would avoid the need to set the unused z field to 1.

Also, the numbers (40 and 80) here are different to the original code (10 and 20). Any reason for that?

Thank you for telling me about cuda_std::glam::UsizeVec2!

About the numbers (40 and 80), I noticed that 10 and 20 (of the original example) didn't collect much summation errors while larger dimensions did.

I wish to ensure some level of clarity to the developer (that views the samples) about the existence of such basic pitfalls and how they can be mitigated (one way being the Kahan summation algorithm).

What do you think?

nnethercote · 2025-12-10T22:58:07Z

samples/introduction/matmul/src/main.rs

+        DeviceBuffer::from_slice(h_c.as_slice()).expect("device array couldn't be initialized!");
+
+    stream.synchronize().expect("Stream couldn't synchronize!");
+    let blocks = BlockSize::xy(block_size as u32, block_size as u32);


Ugh, BlockSize and GridSize should really store usize instead of u32. But nothing to be done about it here.

Also, blocks is called threads in the original, which I find clearer. (blocks makes me think it's the number of blocks, which is what grid actually is.)

Okay, that makes sense.

nnethercote · 2025-12-10T23:05:53Z

samples/introduction/README.md

+1. The matrices are first created on the host side and then copied to the device.
+2. A shared piece of block-specific memory is created (on the device side), so that summation can be done very quickly
+3. The result is copied back to host, where the accumulated error occur.
+4. Extra: The error that accumulates during the summation process is reduced (in the kernel itself) using [Kahan summation algorithm](https://en.wikipedia.org/wiki/Kahan_summation_algorithm).


I wonder if you should eliminate this README and move each per-sample description into a top-level doc comment in the sample's main.rs? Reason being that I generally find it useful to have code-specific documentation in the code rather than in a separate file. That way it's easier to find and more likely to be kept up to date.

Hmmm, that too makes sense.

This structure of documentation was inspired by the coda-samples repo really, but yes I think it would be better to have them as a comment.

…le of `Nvidia/cuda-samples` repo. fix: added shared memory space for matrix multiplication calculation chore: completed matmul/main.rs fix: type corrections and proper copying of result data from device to host fix: code cleanup and stream synchronization after copying C from device to host memory chore: cargo fmt feat: increased dimension of matrices being computed, and implemented Kahan's error correction to stop floating point accumulation errors chore: update readme fix: cargo.toml fixes feat: code cleanup Move documentation to sample-specific top-level comments in main.rs chore: remove documentation

feat: added buildfile

b5fbb79

madhav-madhusoodanan force-pushed the add_cuda_examples branch 2 times, most recently from 2f731da to 7084de1 Compare December 10, 2025 21:33

nnethercote requested changes Dec 10, 2025

View reviewed changes

madhav-madhusoodanan force-pushed the add_cuda_examples branch 3 times, most recently from fd9cef8 to 9964c6b Compare December 11, 2025 20:16

madhav-madhusoodanan requested a review from nnethercote December 12, 2025 06:15

madhav-madhusoodanan force-pushed the add_cuda_examples branch from 9964c6b to 95fac67 Compare December 12, 2025 11:20


		"samples/introduction/async_api",
		"samples/introduction/async_api/kernels",

	This example demonstrates an example kernel implementation of matrix multiplicaation.
	This example demonstrates an example kernel implementation of matrix multiplication.

Samples: Ported the MatrixMul example of CUDA samples #341

Are you sure you want to change the base?

Samples: Ported the MatrixMul example of CUDA samples #341

Uh oh!

Conversation

madhav-madhusoodanan commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Relevant Issue

Uh oh!

madhav-madhusoodanan commented Dec 10, 2025

Uh oh!

nnethercote left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nnethercote Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

madhav-madhusoodanan Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

madhav-madhusoodanan commented Dec 10, 2025 •

edited

Loading

nnethercote Dec 10, 2025 •

edited

Loading

madhav-madhusoodanan Dec 11, 2025 •

edited

Loading