Skip to content

Commit 42afe4b

Browse files
authored
Add experimental 12 sdk support to guide (#167)
1 parent 400f816 commit 42afe4b

File tree

1 file changed

+40
-27
lines changed

1 file changed

+40
-27
lines changed

guide/src/guide/getting_started.md

+40-27
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,25 @@
1-
# Getting Started
1+
# Getting Started
22

33
This section covers how to get started writing GPU crates with `cuda_std` and `cuda_builder`.
44

55
## Required Libraries
66

77
Before you can use the project to write GPU crates, you will need a couple of prerequisites:
8-
- [The CUDA SDK](https://developer.nvidia.com/cuda-downloads), version `11.2-11.8` (and the appropriate driver - [see cuda release notes](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html)) . This is only for building
9-
GPU crates, to execute built PTX you only need CUDA 9+.
8+
9+
- [The CUDA SDK](https://developer.nvidia.com/cuda-downloads), version `11.2-11.8` (and the appropriate driver - [see cuda release notes](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html)).
10+
11+
- We recently [added experimental support for the `12.x`
12+
SDK](https://github.com/Rust-GPU/Rust-CUDA/issues/100), please file any issues you
13+
see
14+
15+
This is only for building GPU crates, to execute built PTX you only need CUDA `9+`.
1016

1117
- LLVM 7.x (7.0 to 7.4), The codegen searches multiple places for LLVM:
18+
1219
- If `LLVM_CONFIG` is present, it will use that path as `llvm-config`.
1320
- Or, if `llvm-config` is present as a binary, it will use that, assuming that `llvm-config --version` returns `7.x.x`.
1421
- Finally, if neither are present or unusable, it will attempt to download and use prebuilt LLVM. This currently only
15-
works on Windows however.
22+
works on Windows however.
1623

1724
- The OptiX SDK if using the optix library (the pathtracer example uses it for denoising).
1825

@@ -69,10 +76,11 @@ use cuda_std::*;
6976
```
7077

7178
This does a couple of things:
79+
7280
- It only applies the attributes if we are compiling the crate for the GPU (target_os = "cuda").
7381
- It declares the crate to be `no_std` on CUDA targets.
7482
- It registers a special attribute required by the codegen for things like figuring out
75-
what functions are GPU kernels.
83+
what functions are GPU kernels.
7684
- It explicitly includes `kernel` macro and `thread`
7785

7886
If you would like to use `alloc` or things like printing from GPU kernels (which requires alloc) then you need to declare `alloc` too:
@@ -89,7 +97,7 @@ Finally, if you would like to use types such as slices or arrays inside of GPU k
8997

9098
## Writing our first GPU kernel
9199

92-
Now we can finally start writing an actual GPU kernel.
100+
Now we can finally start writing an actual GPU kernel.
93101

94102
<details>
95103
<summary>Expand this section if you are not familiar with how GPU-side CUDA works</summary>
@@ -102,24 +110,25 @@ thread, with the number of threads being decided by the caller (the CPU).
102110

103111
We call these parameters the launch dimensions of the kernel. Launch dimensions are split
104112
up into two basic concepts:
105-
- Threads, a single thread executes the GPU kernel __once__, and it makes the index
113+
114+
- Threads, a single thread executes the GPU kernel **once**, and it makes the index
106115
of itself available to the kernel through special registers (functions in our case).
107-
- Blocks, Blocks house multiple threads that they execute on their own. Thread indices
116+
- Blocks, Blocks house multiple threads that they execute on their own. Thread indices
108117
are only unique across the thread's block, therefore CUDA also exposes the index
109118
of the current block.
110119

111120
One important thing to note is that block and thread dimensions may be 1d, 2d, or 3d.
112-
That is to say, i can launch `1` block of `6x6x6`, `6x6`, or `6` threads. I could
121+
That is to say, i can launch `1` block of `6x6x6`, `6x6`, or `6` threads. I could
113122
also launch `5x5x5` blocks. This is very useful for 2d/3d applications because it makes
114-
the 2d/3d index calculations much simpler. CUDA exposes thread and block indices
123+
the 2d/3d index calculations much simpler. CUDA exposes thread and block indices
115124
for each dimension through special registers. We expose thread index queries through
116125
`cuda_std::thread`.
117126

118127
</details>
119128

120129
Now that we know how GPU functions work, let's write a simple kernel. We will write
121-
a kernel which does `[1, 2, 3, 4] + [1, 2, 3, 4] = [2, 4, 6, 8]`. We will use
122-
a 1-dimensional index and use the `cuda_std::thread::index_1d` utility method to
130+
a kernel which does `[1, 2, 3, 4] + [1, 2, 3, 4] = [2, 4, 6, 8]`. We will use
131+
a 1-dimensional index and use the `cuda_std::thread::index_1d` utility method to
123132
calculate a globally-unique thread index for us (this index is only unique if the kernel was launched with a 1d launch config!).
124133

125134
```rs
@@ -134,16 +143,18 @@ pub unsafe fn add(a: &[f32], b: &[f32], c: *mut f32) {
134143
```
135144

136145
If you have used CUDA C++ before, this should seem fairly familiar, with a few oddities:
137-
- Kernel functions must be unsafe currently, this is because the semantics of Rust safety
138-
on the GPU are still very much undecided. This restriction will probably be removed in the future.
146+
147+
- Kernel functions must be unsafe currently, this is because the semantics of Rust safety
148+
on the GPU are still very much undecided. This restriction will probably be removed in the future.
139149
- We use `*mut f32` and not `&mut [f32]`. This is because using `&mut` in function arguments
140-
is unsound. The reason being that rustc assumes `&mut` does not alias. However, because every thread gets a copy of the arguments, this would cause it to alias, thereby violating
141-
this invariant and yielding technically unsound code. Pointers do not have such an invariant on the other hand. Therefore, we use a pointer and only make a mutable reference once we
142-
are sure the elements are disjoint: `let elem = &mut *c.add(idx);`.
150+
is unsound. The reason being that rustc assumes `&mut` does not alias. However, because every thread gets a copy of the arguments, this would cause it to alias, thereby violating
151+
this invariant and yielding technically unsound code. Pointers do not have such an invariant on the other hand. Therefore, we use a pointer and only make a mutable reference once we
152+
are sure the elements are disjoint: `let elem = &mut *c.add(idx);`.
143153
- We check that the index is not out of bounds before doing anything, this is because it is
144-
common to launch kernels with thread amounts that are not exactly divisible by the length for optimization.
154+
common to launch kernels with thread amounts that are not exactly divisible by the length for optimization.
145155

146156
Internally what this does is it first checks that a couple of things are right in the kernel:
157+
147158
- All parameters are `Copy`.
148159
- The function is `unsafe`.
149160
- The function does not return anything.
@@ -180,7 +191,7 @@ fn main() {
180191
```
181192

182193
The first argument is the path to the root of the GPU crate you are trying to build, which would probably be `../name` in our case.
183-
The second function `.copy_to(path)` tells the builder to copy the built PTX file somewhere. By default the builder puts the PTX file
194+
The second function `.copy_to(path)` tells the builder to copy the built PTX file somewhere. By default the builder puts the PTX file
184195
inside of `target/cuda-builder/nvptx64-nvidia-cuda/release/crate_name.ptx`, but it is usually helpful to copy it to another path, which is
185196
what such method does. Finally, `build()` actually runs rustc to compile the crate. This may take a while since it needs to build things like core
186197
from scratch, but after the first compile, incremental will make it much faster.
@@ -212,15 +223,17 @@ components = ["rust-src", "rustc-dev", "llvm-tools-preview"]
212223
There is also a [Dockerfile](Dockerfile) prepared as a quickstart with all the necessary libraries for base cuda development.
213224

214225
You can use it as follows (assuming your clone of Rust-CUDA is at the absolute path `RUST_CUDA`):
215-
- Ensure you have Docker setup to [use gpus](https://docs.docker.com/config/containers/resource_constraints/#gpu)
216-
- Build `docker build -t rust-cuda $RUST_CUDA`
217-
- Run `docker run -it --gpus all -v $RUST_CUDA:/root/rust-cuda --entrypoint /bin/bash rust-cuda`
218-
* Running will drop you into the container's shell and you will find the project at `~/rust-cuda`
219-
- If all is well, you'll be able to `cargo run` in `~/rust-cuda/examples/cuda/cpu/add`
220-
226+
227+
- Ensure you have Docker setup to [use gpus](https://docs.docker.com/config/containers/resource_constraints/#gpu)
228+
- Build `docker build -t rust-cuda $RUST_CUDA`
229+
- Run `docker run -it --gpus all -v $RUST_CUDA:/root/rust-cuda --entrypoint /bin/bash rust-cuda`
230+
- Running will drop you into the container's shell and you will find the project at `~/rust-cuda`
231+
- If all is well, you'll be able to `cargo run` in `~/rust-cuda/examples/cuda/cpu/add`
232+
221233
**Notes:**
234+
222235
1. refer to [rust-toolchain](#rust-toolchain) to ensure you are using the correct toolchain in your project.
223236
2. despite using Docker, your machine will still need to be running a compatible driver, in this case for Cuda 11.4.1 it is >=470.57.02
224237
3. if you have issues within the container, it can help to start ensuring your gpu is recognized
225-
* ensure `nvidia-smi` provides meaningful output in the container
226-
* NVidia provides a number of samples https://github.com/NVIDIA/cuda-samples. In particular, you may want to try `make`ing and running the [`deviceQuery`](https://github.com/NVIDIA/cuda-samples/tree/ba04faaf7328dbcc87bfc9acaf17f951ee5ddcf3/Samples/deviceQuery) sample. If all is well you should see many details about your gpu
238+
- ensure `nvidia-smi` provides meaningful output in the container
239+
- NVidia provides a number of samples https://github.com/NVIDIA/cuda-samples. In particular, you may want to try `make`ing and running the [`deviceQuery`](https://github.com/NVIDIA/cuda-samples/tree/ba04faaf7328dbcc87bfc9acaf17f951ee5ddcf3/Samples/deviceQuery) sample. If all is well you should see many details about your gpu

0 commit comments

Comments
 (0)