Skip to content

Conversation

@madhav-madhusoodanan
Copy link
Contributor

@madhav-madhusoodanan madhav-madhusoodanan commented Dec 10, 2025

Context

Ported the matrixMul sample from NVIDIA/cuda-samples.

Relevant Issue

@madhav-madhusoodanan madhav-madhusoodanan force-pushed the add_cuda_examples branch 2 times, most recently from 2f731da to 7084de1 Compare December 10, 2025 21:33
@madhav-madhusoodanan
Copy link
Contributor Author

Seems there are other modules that need formatting

Copy link
Collaborator

@nnethercote nnethercote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! Generally looking good, though I have a few suggestions.

Also, can you squash the commits together? They conceptually belong together, and there's no point polluting version control history with all these incomplete intermediate revisions.

Cargo.toml Outdated

"samples/introduction/async_api",
"samples/introduction/async_api/kernels",

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for a blank line here.

3. The CPU can query these events to check whether the GPU has finished its work, allowing for coordination between the two processors without blocking the CPU.

## [matrixMul](https://github.com/Rust-GPU/rust-cuda/samples/introduction/matmul)
This example demonstrates an example kernel implementation of matrix multiplicaation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This example demonstrates an example kernel implementation of matrix multiplicaation.
This example demonstrates an example kernel implementation of matrix multiplication.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing that out!

// However, to improve numerical stability, we use Kahan summation here so that the error can be isolated
// and not allow it to accumulate in c_sub
unsafe {
let input = As[ty][k].assume_init() * Bs[k][tx].assume_init();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this unsafe block be shrunk to just be around the let input = unsafe { ... };?


// SAFETY: This function is unsafe because it dereferences raw pointers.
#[kernel]
pub unsafe fn matrix_mul_cuda(C: *mut f32, A: *const f32, B: *const f32, wa: usize, wb: usize) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to name these c, a, b, given that upper-case names are normally used for statics. You'll need to rename some variables below as well, perhaps ai and bi would work.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c must be a raw pointer because it's modified. But a and b can be slices because they are const, which would be more idiomatic. Can you change them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this cause the below error? I'm running into that in the CI

error: `extern` fn uses type `[f32]`, which is not FFI-safe
 --> samples/introduction/matmul/kernels/src/lib.rs:6:47
  |
6 | pub unsafe fn matrix_mul_cuda(c: *mut f32, a: &[f32], b: &[f32], wa: usize, wb: usize) {
  |                                               ^^^^^^ not FFI-safe
  |
  = help: consider using a raw pointer instead
  = note: slices have no C equivalent

let device = Device::get_device(0).expect("Couldn't find Cuda supported devices!");
println!("Device Name: {}", device.name().unwrap());

let block_size: u32 = 32;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a const BLOCK_SIZE: u32 = 32;


let block_size: u32 = 32;
let dims_a: (usize, usize, usize) = (40 * block_size as usize, 40 * block_size as usize, 1);
let dims_b: (usize, usize, usize) = (80 * block_size as usize, 40 * block_size as usize, 1);
Copy link
Collaborator

@nnethercote nnethercote Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the use of a tuple for the dimensions. You could instead use cuda_std::glam::UsizeVec2. That would let you use .x and .y in matrix_mul_cuda, and it would avoid the need to set the unused z field to 1.

Also, the numbers (40 and 80) here are different to the original code (10 and 20). Any reason for that?

Copy link
Contributor Author

@madhav-madhusoodanan madhav-madhusoodanan Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for telling me about cuda_std::glam::UsizeVec2!

About the numbers (40 and 80), I noticed that 10 and 20 (of the original example) didn't collect much summation errors while larger dimensions did.

I wish to ensure some level of clarity to the developer (that views the samples) about the existence of such basic pitfalls and how they can be mitigated (one way being the Kahan summation algorithm).

What do you think?

DeviceBuffer::from_slice(h_c.as_slice()).expect("device array couldn't be initialized!");

stream.synchronize().expect("Stream couldn't synchronize!");
let blocks = BlockSize::xy(block_size as u32, block_size as u32);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, BlockSize and GridSize should really store usize instead of u32. But nothing to be done about it here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, blocks is called threads in the original, which I find clearer. (blocks makes me think it's the number of blocks, which is what grid actually is.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that makes sense.

1. The matrices are first created on the host side and then copied to the device.
2. A shared piece of block-specific memory is created (on the device side), so that summation can be done very quickly
3. The result is copied back to host, where the accumulated error occur.
4. Extra: The error that accumulates during the summation process is reduced (in the kernel itself) using [Kahan summation algorithm](https://en.wikipedia.org/wiki/Kahan_summation_algorithm).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if you should eliminate this README and move each per-sample description into a top-level doc comment in the sample's main.rs? Reason being that I generally find it useful to have code-specific documentation in the code rather than in a separate file. That way it's easier to find and more likely to be kept up to date.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, that too makes sense.

This structure of documentation was inspired by the coda-samples repo really, but yes I think it would be better to have them as a comment.

@madhav-madhusoodanan madhav-madhusoodanan force-pushed the add_cuda_examples branch 3 times, most recently from fd9cef8 to 9964c6b Compare December 11, 2025 20:16
…le of `Nvidia/cuda-samples` repo.

fix: added shared memory space for matrix multiplication calculation

chore: completed matmul/main.rs

fix: type corrections and proper copying of result data from device to host

fix: code cleanup and stream synchronization after copying C from device to host memory

chore: cargo fmt

feat: increased dimension of matrices being computed, and implemented Kahan's error correction to stop floating point accumulation errors

chore: update readme

fix: cargo.toml fixes

feat: code cleanup

Move documentation to sample-specific top-level comments in main.rs

chore: remove documentation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants