-
Notifications
You must be signed in to change notification settings - Fork 217
Samples: Ported the MatrixMul example of CUDA samples #341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
madhav-madhusoodanan
wants to merge
2
commits into
Rust-GPU:main
Choose a base branch
from
madhav-madhusoodanan:add_cuda_examples
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+322
−8
Open
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| [package] | ||
| name = "matmul" | ||
| version = "0.1.0" | ||
| edition = "2024" | ||
|
|
||
| [dependencies] | ||
| cust = { path = "../../../crates/cust" } | ||
| cuda_std = { path = "../../../crates/cuda_std" } | ||
|
|
||
| [build-dependencies] | ||
| cuda_builder = { workspace = true, default-features = false } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| use std::env; | ||
| use std::path; | ||
|
|
||
| use cuda_builder::CudaBuilder; | ||
|
|
||
| fn main() { | ||
| println!("cargo::rerun-if-changed=build.rs"); | ||
| println!("cargo::rerun-if-changed=kernels"); | ||
|
|
||
| let out_path = path::PathBuf::from(env::var("OUT_DIR").unwrap()); | ||
| let manifest_dir = path::PathBuf::from(env::var("CARGO_MANIFEST_DIR").unwrap()); | ||
|
|
||
| CudaBuilder::new(manifest_dir.join("kernels")) | ||
| .copy_to(out_path.join("kernels.ptx")) | ||
| .build() | ||
| .unwrap(); | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,10 @@ | ||
| [package] | ||
| name = "kernels" | ||
| version = "0.1.0" | ||
| edition = "2024" | ||
|
|
||
| [dependencies] | ||
| cuda_std = { path = "../../../../crates/cuda_std" } | ||
|
|
||
| [lib] | ||
| crate-type = ["cdylib", "rlib"] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,76 @@ | ||
| use core::mem::MaybeUninit; | ||
| use cuda_std::*; | ||
|
|
||
| // SAFETY: This function is unsafe because it dereferences raw pointers. | ||
| #[kernel] | ||
| pub unsafe fn matrix_mul_cuda(c: *mut f32, a: &[f32], b: &[f32], wa: usize, wb: usize) { | ||
| let bx: usize = cuda_std::thread::block_idx().x as usize; | ||
| let by: usize = cuda_std::thread::block_idx().y as usize; | ||
|
|
||
| let tx: usize = cuda_std::thread::thread_idx().x as usize; | ||
| let ty: usize = cuda_std::thread::thread_idx().y as usize; | ||
|
|
||
| const BLOCK_SIZE: usize = 32; | ||
| let a_begin = wa * BLOCK_SIZE * by; | ||
| let a_end = a_begin + wa - 1; | ||
| let a_step = BLOCK_SIZE; | ||
|
|
||
| let b_begin = BLOCK_SIZE * bx; | ||
| let b_step = BLOCK_SIZE * wb; | ||
|
|
||
| let mut c_sub: f32 = 0.0; | ||
| let mut kahan_correction_factor = 0.0f32; | ||
| let mut bi = b_begin; | ||
|
|
||
| for ai in (a_begin..=a_end).step_by(a_step) { | ||
| // The equivalent Cuda C++ code for the below is: | ||
| // ``` | ||
| // __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; | ||
| // ``` | ||
| // This memory space is shared between threads of the same block | ||
| #[address_space(shared)] | ||
| static mut As: [[MaybeUninit<f32>; BLOCK_SIZE]; BLOCK_SIZE] = | ||
| [[const { MaybeUninit::uninit() }; BLOCK_SIZE]; BLOCK_SIZE]; | ||
|
|
||
| #[address_space(shared)] | ||
| static mut Bs: [[MaybeUninit<f32>; BLOCK_SIZE]; BLOCK_SIZE] = | ||
| [[const { MaybeUninit::uninit() }; BLOCK_SIZE]; BLOCK_SIZE]; | ||
|
|
||
| // Load A and B matrices into shared memory | ||
| // A.add(index) returns the pointer to the index-th element of A | ||
| // Hence a dereference is needed to get the value at that index | ||
| unsafe { | ||
| As[ty][tx].write(a[ai + wa * ty + tx]); | ||
| Bs[ty][tx].write(b[bi + wb * ty + tx]); | ||
| } | ||
|
|
||
| // Synchronize to make sure the matrices are loaded | ||
| cuda_std::thread::sync_threads(); | ||
| for k in 0..BLOCK_SIZE { | ||
| // Typically, this would be a simple calculation: | ||
| // ``` | ||
| // c_sub += As[ty][k] * Bs[k][tx]; | ||
| // ``` | ||
| // However, to improve numerical stability, we use Kahan summation here so that the error can be isolated | ||
| // and not allow it to accumulate in c_sub | ||
| let input = unsafe { As[ty][k].assume_init() * Bs[k][tx].assume_init() }; | ||
| let y = input - kahan_correction_factor; | ||
| let sum = c_sub + y; | ||
|
|
||
| // This seems like the correction factor would yield zero, however due to f32 precision limitations, | ||
| // it helps to isolate the small errors that would otherwise accumulate in c_sub | ||
| kahan_correction_factor = (sum - c_sub) - y; | ||
| c_sub = sum; | ||
| } | ||
|
|
||
| // Synchronize to make sure that the preceding computation is done before loading two new sub-matrices of A and B in the next iteration | ||
| cuda_std::thread::sync_threads(); | ||
|
|
||
| bi += b_step; | ||
| } | ||
|
|
||
| let ci = wb * BLOCK_SIZE * by + BLOCK_SIZE * bx; | ||
| unsafe { | ||
| *c.add((ci + wb * ty + tx) as usize) = c_sub; | ||
| } | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,183 @@ | ||
| /* This example demonstrates an implementation of matrix multiplication. | ||
| * | ||
| * 1. The matrices are first created on the host side and then copied to the device. | ||
| * 2. A shared piece of block-specific memory is created (on the device side), so that summation can be done very quickly | ||
| * 3. The result is copied back to host, where the accumulated error occur. | ||
| * 4. Extra: The error that accumulates during the summation process is reduced (in the kernel itself) using [Kahan summation algorithm](https://en.wikipedia.org/wiki/Kahan_summation_algorithm). | ||
| */ | ||
|
|
||
| use cuda_std::glam::USizeVec2; | ||
| use cust::device::Device; | ||
| use cust::event::{Event, EventFlags}; | ||
| use cust::function::{BlockSize, GridSize}; | ||
| use cust::launch; | ||
| use cust::memory::{AsyncCopyDestination, DeviceBuffer, LockedBuffer}; | ||
| use cust::module::Module; | ||
| use cust::stream::{Stream, StreamFlags}; | ||
|
|
||
| static PTX: &str = include_str!(concat!(env!("OUT_DIR"), "/kernels.ptx")); | ||
|
|
||
| fn matrix_multiply( | ||
| block_size: usize, | ||
| dims_a: USizeVec2, | ||
| dims_b: USizeVec2, | ||
| ) -> Result<(), cust::error::CudaError> { | ||
| let dims_c = USizeVec2::new(dims_b.x, dims_a.y); | ||
| let size_a = dims_a.x * dims_a.y; | ||
| let h_a = LockedBuffer::new(&1.0f32, size_a).expect("host array couldn't be initialized!"); | ||
|
|
||
| let size_b = dims_b.x * dims_b.y; | ||
| let h_b = LockedBuffer::new(&0.01f32, size_b).expect("host arrray couldn't be initialized!"); | ||
|
|
||
| let stream = Stream::new(StreamFlags::NON_BLOCKING, None).expect("Stream couldn't be init!"); | ||
|
|
||
| let size_c = dims_b.x * dims_a.y; | ||
| let mut h_c = LockedBuffer::new(&0.0f32, size_c).expect("host array couldn't be initialized!"); | ||
|
|
||
| let start_event = Event::new(EventFlags::DEFAULT)?; | ||
| let stop_event = Event::new(EventFlags::DEFAULT)?; | ||
|
|
||
| let d_a = | ||
| DeviceBuffer::from_slice(h_a.as_slice()).expect("device array couldn't be initialized!"); | ||
| let d_b = | ||
| DeviceBuffer::from_slice(h_b.as_slice()).expect("device array couldn't be initialized!"); | ||
| let d_c = | ||
| DeviceBuffer::from_slice(h_c.as_slice()).expect("device array couldn't be initialized!"); | ||
|
|
||
| stream.synchronize().expect("Stream couldn't synchronize!"); | ||
| let threads = BlockSize::xy(block_size as u32, block_size as u32); | ||
| let grid = GridSize::xy( | ||
| (dims_b.x / (threads.x as usize)).try_into().unwrap(), | ||
| (dims_a.y / (threads.y as usize)).try_into().unwrap(), | ||
| ); | ||
|
|
||
| println!("Computing result using CUDA Kernel..."); | ||
|
|
||
| let module = Module::from_ptx(PTX, &[]).expect("Module couldn't be init!"); | ||
| let matrix_mul_cuda = module | ||
| .get_function("matrix_mul_cuda") | ||
| .expect("Kernel function not found!"); | ||
|
|
||
| unsafe { | ||
| // The function definition of the kernel is: | ||
| // ``` | ||
| // pub unsafe fn matrix_mul_cuda(c: *mut f32, a: &[f32], b: &[f32], wa: usize, wb: usize) | ||
| // ``` | ||
| // For elements that have the type `*mut T` or `*const T`, we'll need to pass only the device pointer. | ||
| // For elements that have the type `&[T]`, we must pass the device pointer as well as the length of the slice. | ||
| launch!(matrix_mul_cuda<<<grid, threads, 0, stream>>>( | ||
| d_c.as_device_ptr(), | ||
| d_a.as_device_ptr(), | ||
| d_a.len(), | ||
| d_b.as_device_ptr(), | ||
| d_b.len(), | ||
| dims_a.x, | ||
| dims_b.x | ||
| ))?; | ||
| } | ||
|
|
||
| println!("Done!"); | ||
| stream.synchronize().expect("Stream couldn't synchronize!"); | ||
|
|
||
| start_event | ||
| .record(&stream) | ||
| .expect("Failed to record start_event in the CUDA stream!"); | ||
|
|
||
| const N_ITER: u32 = 300; | ||
|
|
||
| for _ in 0..N_ITER { | ||
| unsafe { | ||
| launch!(matrix_mul_cuda<<<grid, threads, 0, stream>>>( | ||
| d_c.as_device_ptr(), | ||
| d_a.as_device_ptr(), | ||
| d_a.len(), | ||
| d_b.as_device_ptr(), | ||
| d_b.len(), | ||
| dims_a.x, | ||
| dims_b.x, | ||
| ))?; | ||
| } | ||
| } | ||
|
|
||
| stop_event | ||
| .record(&stream) | ||
| .expect("Failed to record stop_event in the CUDA stream!"); | ||
|
|
||
| stop_event | ||
| .synchronize() | ||
| .expect("Stream couldn't synchronize!"); | ||
|
|
||
| let gpu_time: u128 = stop_event | ||
| .elapsed(&start_event) | ||
| .expect("Failed to calculate duration of GPU operations!") | ||
| .as_micros(); | ||
|
|
||
| let avg_time = gpu_time as f32 / N_ITER as f32; | ||
| println!( | ||
| "Average time spent executing by the GPU: {} microseconds", | ||
| avg_time | ||
| ); | ||
| let flops_per_matrix_mul = 2.0 * (dims_a.x as f32) * (dims_a.y as f32) * (dims_b.x as f32); | ||
| let giga_flops = (flops_per_matrix_mul / (avg_time)) / 1000.0; | ||
| println!("Performance = {} GFlop/s", giga_flops); | ||
|
|
||
| unsafe { | ||
| d_c.async_copy_to(&mut h_c, &stream) | ||
| .expect("Could not copy from device to host!"); | ||
| } | ||
| stream.synchronize().expect("Stream couldn't synchronize!"); | ||
|
|
||
| // checking computed result | ||
| // test relative error by the formula | ||
| // |<x, y>_cpu - <x, y>_gpu| / |<x, y>_cpu| | ||
| let machine_epsilon = 1.1920929E-07f32; | ||
| let mut correct = true; | ||
|
|
||
| for i in 0..(dims_c.x * dims_c.y) { | ||
| let abs_err = (h_c[i] - (dims_a.x as f32 * 0.01f32)).abs(); | ||
| let dot_length = (dims_a.x as f32).abs(); | ||
| let abs_val = h_c[i].abs(); | ||
| let rel_err = abs_err / abs_val.max(dot_length * machine_epsilon); | ||
|
|
||
| if rel_err > 1e-6 { | ||
| println!( | ||
| "Error at index {}: CPU = {}, GPU = {}, rel_err = {}", | ||
| i, | ||
| dims_a.x as f32 * 0.01f32, | ||
| h_c[i], | ||
| rel_err | ||
| ); | ||
| correct = false; | ||
| } | ||
| } | ||
|
|
||
| if correct { | ||
| println!("Result = PASS"); | ||
| println!( | ||
| "NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled." | ||
| ); | ||
| } else { | ||
| println!("Result = FAIL"); | ||
| return Err(cust::error::CudaError::UnknownError); | ||
| } | ||
|
|
||
| Ok(()) | ||
| } | ||
|
|
||
| fn main() -> Result<(), cust::error::CudaError> { | ||
| // Set up the context, load the module, and create a stream to run kernels in. | ||
| let _ctx = cust::quick_init(); | ||
| let device = Device::get_device(0).expect("Couldn't find Cuda supported devices!"); | ||
| println!("Device Name: {}", device.name().unwrap()); | ||
|
|
||
| let block_size: u32 = 32; | ||
| let dims_a = USizeVec2::new(40 * block_size as usize, 40 * block_size as usize); | ||
| let dims_b = USizeVec2::new(80 * block_size as usize, 40 * block_size as usize); | ||
|
|
||
| if dims_a.x != dims_b.y { | ||
| panic!("Matrix multiplication not possible with the given dimensions!"); | ||
| } | ||
|
|
||
| matrix_multiply(block_size as usize, dims_a, dims_b)?; | ||
| Ok(()) | ||
| } | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a
const BLOCK_SIZE: u32 = 32;