-
Notifications
You must be signed in to change notification settings - Fork 217
Samples: Ported the MatrixMul example of CUDA samples #341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Samples: Ported the MatrixMul example of CUDA samples #341
Conversation
2f731da to
7084de1
Compare
|
Seems there are other modules that need formatting |
nnethercote
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this! Generally looking good, though I have a few suggestions.
Also, can you squash the commits together? They conceptually belong together, and there's no point polluting version control history with all these incomplete intermediate revisions.
Cargo.toml
Outdated
|
|
||
| "samples/introduction/async_api", | ||
| "samples/introduction/async_api/kernels", | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for a blank line here.
samples/introduction/README.md
Outdated
| 3. The CPU can query these events to check whether the GPU has finished its work, allowing for coordination between the two processors without blocking the CPU. | ||
|
|
||
| ## [matrixMul](https://github.com/Rust-GPU/rust-cuda/samples/introduction/matmul) | ||
| This example demonstrates an example kernel implementation of matrix multiplicaation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| This example demonstrates an example kernel implementation of matrix multiplicaation. | |
| This example demonstrates an example kernel implementation of matrix multiplication. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pointing that out!
| // However, to improve numerical stability, we use Kahan summation here so that the error can be isolated | ||
| // and not allow it to accumulate in c_sub | ||
| unsafe { | ||
| let input = As[ty][k].assume_init() * Bs[k][tx].assume_init(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this unsafe block be shrunk to just be around the let input = unsafe { ... };?
|
|
||
| // SAFETY: This function is unsafe because it dereferences raw pointers. | ||
| #[kernel] | ||
| pub unsafe fn matrix_mul_cuda(C: *mut f32, A: *const f32, B: *const f32, wa: usize, wb: usize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably better to name these c, a, b, given that upper-case names are normally used for statics. You'll need to rename some variables below as well, perhaps ai and bi would work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c must be a raw pointer because it's modified. But a and b can be slices because they are const, which would be more idiomatic. Can you change them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this cause the below error? I'm running into that in the CI
error: `extern` fn uses type `[f32]`, which is not FFI-safe
--> samples/introduction/matmul/kernels/src/lib.rs:6:47
|
6 | pub unsafe fn matrix_mul_cuda(c: *mut f32, a: &[f32], b: &[f32], wa: usize, wb: usize) {
| ^^^^^^ not FFI-safe
|
= help: consider using a raw pointer instead
= note: slices have no C equivalent
| let device = Device::get_device(0).expect("Couldn't find Cuda supported devices!"); | ||
| println!("Device Name: {}", device.name().unwrap()); | ||
|
|
||
| let block_size: u32 = 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a const BLOCK_SIZE: u32 = 32;
|
|
||
| let block_size: u32 = 32; | ||
| let dims_a: (usize, usize, usize) = (40 * block_size as usize, 40 * block_size as usize, 1); | ||
| let dims_b: (usize, usize, usize) = (80 * block_size as usize, 40 * block_size as usize, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the use of a tuple for the dimensions. You could instead use cuda_std::glam::UsizeVec2. That would let you use .x and .y in matrix_mul_cuda, and it would avoid the need to set the unused z field to 1.
Also, the numbers (40 and 80) here are different to the original code (10 and 20). Any reason for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for telling me about cuda_std::glam::UsizeVec2!
About the numbers (40 and 80), I noticed that 10 and 20 (of the original example) didn't collect much summation errors while larger dimensions did.
I wish to ensure some level of clarity to the developer (that views the samples) about the existence of such basic pitfalls and how they can be mitigated (one way being the Kahan summation algorithm).
What do you think?
| DeviceBuffer::from_slice(h_c.as_slice()).expect("device array couldn't be initialized!"); | ||
|
|
||
| stream.synchronize().expect("Stream couldn't synchronize!"); | ||
| let blocks = BlockSize::xy(block_size as u32, block_size as u32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, BlockSize and GridSize should really store usize instead of u32. But nothing to be done about it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, blocks is called threads in the original, which I find clearer. (blocks makes me think it's the number of blocks, which is what grid actually is.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, that makes sense.
samples/introduction/README.md
Outdated
| 1. The matrices are first created on the host side and then copied to the device. | ||
| 2. A shared piece of block-specific memory is created (on the device side), so that summation can be done very quickly | ||
| 3. The result is copied back to host, where the accumulated error occur. | ||
| 4. Extra: The error that accumulates during the summation process is reduced (in the kernel itself) using [Kahan summation algorithm](https://en.wikipedia.org/wiki/Kahan_summation_algorithm). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if you should eliminate this README and move each per-sample description into a top-level doc comment in the sample's main.rs? Reason being that I generally find it useful to have code-specific documentation in the code rather than in a separate file. That way it's easier to find and more likely to be kept up to date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, that too makes sense.
This structure of documentation was inspired by the coda-samples repo really, but yes I think it would be better to have them as a comment.
fd9cef8 to
9964c6b
Compare
…le of `Nvidia/cuda-samples` repo. fix: added shared memory space for matrix multiplication calculation chore: completed matmul/main.rs fix: type corrections and proper copying of result data from device to host fix: code cleanup and stream synchronization after copying C from device to host memory chore: cargo fmt feat: increased dimension of matrices being computed, and implemented Kahan's error correction to stop floating point accumulation errors chore: update readme fix: cargo.toml fixes feat: code cleanup Move documentation to sample-specific top-level comments in main.rs chore: remove documentation
9964c6b to
95fac67
Compare
Context
Ported the
matrixMulsample fromNVIDIA/cuda-samples.Relevant Issue