You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: guide/src/guide/getting_started.md
+40-27
Original file line number
Diff line number
Diff line change
@@ -1,18 +1,25 @@
1
-
# Getting Started
1
+
# Getting Started
2
2
3
3
This section covers how to get started writing GPU crates with `cuda_std` and `cuda_builder`.
4
4
5
5
## Required Libraries
6
6
7
7
Before you can use the project to write GPU crates, you will need a couple of prerequisites:
8
-
-[The CUDA SDK](https://developer.nvidia.com/cuda-downloads), version `11.2-11.8` (and the appropriate driver - [see cuda release notes](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html)) . This is only for building
9
-
GPU crates, to execute built PTX you only need CUDA 9+.
8
+
9
+
-[The CUDA SDK](https://developer.nvidia.com/cuda-downloads), version `11.2-11.8` (and the appropriate driver - [see cuda release notes](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html)).
10
+
11
+
- We recently [added experimental support for the `12.x`
12
+
SDK](https://github.com/Rust-GPU/Rust-CUDA/issues/100), please file any issues you
13
+
see
14
+
15
+
This is only for building GPU crates, to execute built PTX you only need CUDA `9+`.
10
16
11
17
- LLVM 7.x (7.0 to 7.4), The codegen searches multiple places for LLVM:
18
+
12
19
- If `LLVM_CONFIG` is present, it will use that path as `llvm-config`.
13
20
- Or, if `llvm-config` is present as a binary, it will use that, assuming that `llvm-config --version` returns `7.x.x`.
14
21
- Finally, if neither are present or unusable, it will attempt to download and use prebuilt LLVM. This currently only
15
-
works on Windows however.
22
+
works on Windows however.
16
23
17
24
- The OptiX SDK if using the optix library (the pathtracer example uses it for denoising).
18
25
@@ -69,10 +76,11 @@ use cuda_std::*;
69
76
```
70
77
71
78
This does a couple of things:
79
+
72
80
- It only applies the attributes if we are compiling the crate for the GPU (target_os = "cuda").
73
81
- It declares the crate to be `no_std` on CUDA targets.
74
82
- It registers a special attribute required by the codegen for things like figuring out
75
-
what functions are GPU kernels.
83
+
what functions are GPU kernels.
76
84
- It explicitly includes `kernel` macro and `thread`
77
85
78
86
If you would like to use `alloc` or things like printing from GPU kernels (which requires alloc) then you need to declare `alloc` too:
@@ -89,7 +97,7 @@ Finally, if you would like to use types such as slices or arrays inside of GPU k
89
97
90
98
## Writing our first GPU kernel
91
99
92
-
Now we can finally start writing an actual GPU kernel.
100
+
Now we can finally start writing an actual GPU kernel.
93
101
94
102
<details>
95
103
<summary>Expand this section if you are not familiar with how GPU-side CUDA works</summary>
@@ -102,24 +110,25 @@ thread, with the number of threads being decided by the caller (the CPU).
102
110
103
111
We call these parameters the launch dimensions of the kernel. Launch dimensions are split
104
112
up into two basic concepts:
105
-
- Threads, a single thread executes the GPU kernel __once__, and it makes the index
113
+
114
+
- Threads, a single thread executes the GPU kernel **once**, and it makes the index
106
115
of itself available to the kernel through special registers (functions in our case).
107
-
- Blocks, Blocks house multiple threads that they execute on their own. Thread indices
116
+
- Blocks, Blocks house multiple threads that they execute on their own. Thread indices
108
117
are only unique across the thread's block, therefore CUDA also exposes the index
109
118
of the current block.
110
119
111
120
One important thing to note is that block and thread dimensions may be 1d, 2d, or 3d.
112
-
That is to say, i can launch `1` block of `6x6x6`, `6x6`, or `6` threads. I could
121
+
That is to say, i can launch `1` block of `6x6x6`, `6x6`, or `6` threads. I could
113
122
also launch `5x5x5` blocks. This is very useful for 2d/3d applications because it makes
114
-
the 2d/3d index calculations much simpler. CUDA exposes thread and block indices
123
+
the 2d/3d index calculations much simpler. CUDA exposes thread and block indices
115
124
for each dimension through special registers. We expose thread index queries through
116
125
`cuda_std::thread`.
117
126
118
127
</details>
119
128
120
129
Now that we know how GPU functions work, let's write a simple kernel. We will write
121
-
a kernel which does `[1, 2, 3, 4] + [1, 2, 3, 4] = [2, 4, 6, 8]`. We will use
122
-
a 1-dimensional index and use the `cuda_std::thread::index_1d` utility method to
130
+
a kernel which does `[1, 2, 3, 4] + [1, 2, 3, 4] = [2, 4, 6, 8]`. We will use
131
+
a 1-dimensional index and use the `cuda_std::thread::index_1d` utility method to
123
132
calculate a globally-unique thread index for us (this index is only unique if the kernel was launched with a 1d launch config!).
If you have used CUDA C++ before, this should seem fairly familiar, with a few oddities:
137
-
- Kernel functions must be unsafe currently, this is because the semantics of Rust safety
138
-
on the GPU are still very much undecided. This restriction will probably be removed in the future.
146
+
147
+
- Kernel functions must be unsafe currently, this is because the semantics of Rust safety
148
+
on the GPU are still very much undecided. This restriction will probably be removed in the future.
139
149
- We use `*mut f32` and not `&mut [f32]`. This is because using `&mut` in function arguments
140
-
is unsound. The reason being that rustc assumes `&mut` does not alias. However, because every thread gets a copy of the arguments, this would cause it to alias, thereby violating
141
-
this invariant and yielding technically unsound code. Pointers do not have such an invariant on the other hand. Therefore, we use a pointer and only make a mutable reference once we
142
-
are sure the elements are disjoint: `let elem = &mut *c.add(idx);`.
150
+
is unsound. The reason being that rustc assumes `&mut` does not alias. However, because every thread gets a copy of the arguments, this would cause it to alias, thereby violating
151
+
this invariant and yielding technically unsound code. Pointers do not have such an invariant on the other hand. Therefore, we use a pointer and only make a mutable reference once we
152
+
are sure the elements are disjoint: `let elem = &mut *c.add(idx);`.
143
153
- We check that the index is not out of bounds before doing anything, this is because it is
144
-
common to launch kernels with thread amounts that are not exactly divisible by the length for optimization.
154
+
common to launch kernels with thread amounts that are not exactly divisible by the length for optimization.
145
155
146
156
Internally what this does is it first checks that a couple of things are right in the kernel:
157
+
147
158
- All parameters are `Copy`.
148
159
- The function is `unsafe`.
149
160
- The function does not return anything.
@@ -180,7 +191,7 @@ fn main() {
180
191
```
181
192
182
193
The first argument is the path to the root of the GPU crate you are trying to build, which would probably be `../name` in our case.
183
-
The second function `.copy_to(path)` tells the builder to copy the built PTX file somewhere. By default the builder puts the PTX file
194
+
The second function `.copy_to(path)` tells the builder to copy the built PTX file somewhere. By default the builder puts the PTX file
184
195
inside of `target/cuda-builder/nvptx64-nvidia-cuda/release/crate_name.ptx`, but it is usually helpful to copy it to another path, which is
185
196
what such method does. Finally, `build()` actually runs rustc to compile the crate. This may take a while since it needs to build things like core
186
197
from scratch, but after the first compile, incremental will make it much faster.
There is also a [Dockerfile](Dockerfile) prepared as a quickstart with all the necessary libraries for base cuda development.
213
224
214
225
You can use it as follows (assuming your clone of Rust-CUDA is at the absolute path `RUST_CUDA`):
215
-
- Ensure you have Docker setup to [use gpus](https://docs.docker.com/config/containers/resource_constraints/#gpu)
216
-
- Build `docker build -t rust-cuda $RUST_CUDA`
217
-
- Run `docker run -it --gpus all -v $RUST_CUDA:/root/rust-cuda --entrypoint /bin/bash rust-cuda`
218
-
* Running will drop you into the container's shell and you will find the project at `~/rust-cuda`
219
-
- If all is well, you'll be able to `cargo run` in `~/rust-cuda/examples/cuda/cpu/add`
220
-
226
+
227
+
- Ensure you have Docker setup to [use gpus](https://docs.docker.com/config/containers/resource_constraints/#gpu)
228
+
- Build `docker build -t rust-cuda $RUST_CUDA`
229
+
- Run `docker run -it --gpus all -v $RUST_CUDA:/root/rust-cuda --entrypoint /bin/bash rust-cuda`
230
+
- Running will drop you into the container's shell and you will find the project at `~/rust-cuda`
231
+
- If all is well, you'll be able to `cargo run` in `~/rust-cuda/examples/cuda/cpu/add`
232
+
221
233
**Notes:**
234
+
222
235
1. refer to [rust-toolchain](#rust-toolchain) to ensure you are using the correct toolchain in your project.
223
236
2. despite using Docker, your machine will still need to be running a compatible driver, in this case for Cuda 11.4.1 it is >=470.57.02
224
237
3. if you have issues within the container, it can help to start ensuring your gpu is recognized
225
-
* ensure `nvidia-smi` provides meaningful output in the container
226
-
* NVidia provides a number of samples https://github.com/NVIDIA/cuda-samples. In particular, you may want to try `make`ing and running the [`deviceQuery`](https://github.com/NVIDIA/cuda-samples/tree/ba04faaf7328dbcc87bfc9acaf17f951ee5ddcf3/Samples/deviceQuery) sample. If all is well you should see many details about your gpu
238
+
- ensure `nvidia-smi` provides meaningful output in the container
239
+
- NVidia provides a number of samples https://github.com/NVIDIA/cuda-samples. In particular, you may want to try `make`ing and running the [`deviceQuery`](https://github.com/NVIDIA/cuda-samples/tree/ba04faaf7328dbcc87bfc9acaf17f951ee5ddcf3/Samples/deviceQuery) sample. If all is well you should see many details about your gpu
0 commit comments