Skip to content

Commit 5587a79

Browse files
authored
Merge pull request #2524 from rust-lang/offload-device
add gpu device side instructions
2 parents c221508 + 3a8808a commit 5587a79

File tree

3 files changed

+114
-28
lines changed

3 files changed

+114
-28
lines changed

src/SUMMARY.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,7 @@
103103
- [The `rustdoc-json` test suite](./rustdoc-internals/rustdoc-json-test-suite.md)
104104
- [GPU offload internals](./offload/internals.md)
105105
- [Installation](./offload/installation.md)
106+
- [Usage](./offload/usage.md)
106107
- [Autodiff internals](./autodiff/internals.md)
107108
- [Installation](./autodiff/installation.md)
108109
- [How to debug](./autodiff/debugging.md)

src/offload/installation.md

Lines changed: 1 addition & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Installation
22

3-
In the future, `std::offload` should become available in nightly builds for users. For now, everyone still needs to build rustc from source.
3+
`std::offload` is partly available in nightly builds for users. For now, everyone however still needs to build rustc from source to use all features of it.
44

55
## Build instructions
66

@@ -42,30 +42,3 @@ run
4242
```
4343
./x test --stage 1 tests/codegen-llvm/gpu_offload
4444
```
45-
46-
## Usage
47-
It is important to use a clang compiler build on the same llvm as rustc. Just calling clang without the full path will likely use your system clang, which probably will be incompatible.
48-
```
49-
/absolute/path/to/rust/build/x86_64-unknown-linux-gnu/stage1/bin/rustc --edition=2024 --crate-type cdylib src/main.rs --emit=llvm-ir -O -C lto=fat -Cpanic=abort -Zoffload=Enable
50-
/absolute/path/to/rust/build/x86_64-unknown-linux-gnu/llvm/bin/clang++ -fopenmp --offload-arch=native -g -O3 main.ll -o main -save-temps
51-
LIBOMPTARGET_INFO=-1 ./main
52-
```
53-
The first step will generate a `main.ll` file, which has enough instructions to cause the offload runtime to move data to and from a gpu.
54-
The second step will use clang as the compilation driver to compile our IR file down to a working binary. Only a very small Rust subset will work out of the box here, unless
55-
you use features like build-std, which are not covered by this guide. Look at the codegen test to get a feeling for how to write a working example.
56-
In the last step you can run your binary, if all went well you will see a data transfer being reported:
57-
```
58-
omptarget device 0 info: Entering OpenMP data region with being_mapper at unknown:0:0 with 1 arguments:
59-
omptarget device 0 info: tofrom(unknown)[1024]
60-
omptarget device 0 info: Creating new map entry with HstPtrBase=0x00007fffffff9540, HstPtrBegin=0x00007fffffff9540, TgtAllocBegin=0x0000155547200000, TgtPtrBegin=0x0000155547200000, Size=1024, DynRefCount=1, HoldRefCount=0, Name=unknown
61-
omptarget device 0 info: Copying data from host to device, HstPtr=0x00007fffffff9540, TgtPtr=0x0000155547200000, Size=1024, Name=unknown
62-
omptarget device 0 info: OpenMP Host-Device pointer mappings after block at unknown:0:0:
63-
omptarget device 0 info: Host Ptr Target Ptr Size (B) DynRefCount HoldRefCount Declaration
64-
omptarget device 0 info: 0x00007fffffff9540 0x0000155547200000 1024 1 0 unknown at unknown:0:0
65-
// some other output
66-
omptarget device 0 info: Exiting OpenMP data region with end_mapper at unknown:0:0 with 1 arguments:
67-
omptarget device 0 info: tofrom(unknown)[1024]
68-
omptarget device 0 info: Mapping exists with HstPtrBegin=0x00007fffffff9540, TgtPtrBegin=0x0000155547200000, Size=1024, DynRefCount=0 (decremented, delayed deletion), HoldRefCount=0
69-
omptarget device 0 info: Copying data from device to host, TgtPtr=0x0000155547200000, HstPtr=0x00007fffffff9540, Size=1024, Name=unknown
70-
omptarget device 0 info: Removing map entry with HstPtrBegin=0x00007fffffff9540, TgtPtrBegin=0x0000155547200000, Size=1024, Name=unknown
71-
```

src/offload/usage.md

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# Usage
2+
3+
This feature is work-in-progress, and not ready for usage. The instructions here are for contributors, or people interested in following the latest progress.
4+
We currently work on launching the following Rust kernel on the GPU. To follow along, copy it to a `src/lib.rs` file.
5+
6+
```rust
7+
#![feature(abi_gpu_kernel)]
8+
#![no_std]
9+
10+
#[cfg(target_os = "linux")]
11+
extern crate libc;
12+
#[cfg(target_os = "linux")]
13+
use libc::c_char;
14+
15+
use core::mem;
16+
17+
#[panic_handler]
18+
fn panic(_: &core::panic::PanicInfo) -> ! {
19+
loop {}
20+
}
21+
22+
#[cfg(target_os = "linux")]
23+
#[unsafe(no_mangle)]
24+
#[inline(never)]
25+
fn main() {
26+
let array_c: *mut [f64; 256] =
27+
unsafe { libc::calloc(256, (mem::size_of::<f64>()) as libc::size_t) as *mut [f64; 256] };
28+
let output = c"The first element is zero %f\n";
29+
let output2 = c"The first element is NOT zero %f\n";
30+
let output3 = c"The second element is %f\n";
31+
unsafe {
32+
let val: *const c_char = if (*array_c)[0] < 0.1 {
33+
output.as_ptr()
34+
} else {
35+
output2.as_ptr()
36+
};
37+
libc::printf(val, (*array_c)[0]);
38+
}
39+
40+
unsafe {
41+
kernel_1(array_c);
42+
}
43+
core::hint::black_box(&array_c);
44+
unsafe {
45+
let val: *const c_char = if (*array_c)[0] < 0.1 {
46+
output.as_ptr()
47+
} else {
48+
output2.as_ptr()
49+
};
50+
libc::printf(val, (*array_c)[0]);
51+
libc::printf(output3.as_ptr(), (*array_c)[1]);
52+
}
53+
}
54+
55+
#[cfg(target_os = "linux")]
56+
unsafe extern "C" {
57+
pub fn kernel_1(array_b: *mut [f64; 256]);
58+
}
59+
```
60+
61+
## Compile instructions
62+
It is important to use a clang compiler build on the same llvm as rustc. Just calling clang without the full path will likely use your system clang, which probably will be incompatible. So either substitute clang/lld invocations below with absolute path, or set your `PATH` accordingly.
63+
64+
First we generate the host (cpu) code. The first build is just to compile libc, take note of the hashed path. Then we call rustc directly to build our host code, while providing the libc artifact to rustc.
65+
```
66+
cargo +offload build -r -v
67+
rustc +offload --edition 2024 src/lib.rs -g --crate-type cdylib -C opt-level=3 -C panic=abort -C lto=fat -L dependency=/absolute_path_to/target/release/deps --extern libc=/absolute_path_to/target/release/deps/liblibc-<HASH>.rlib --emit=llvm-bc,llvm-ir -Zoffload=Enable -Zunstable-options
68+
```
69+
70+
Now we generate the device code. Replace the target-cpu with the right code for your gpu.
71+
```
72+
RUSTFLAGS="-Ctarget-cpu=gfx90a --emit=llvm-bc,llvm-ir" cargo +offload build -Zunstable-options -r -v --target amdgcn-amd-amdhsa -Zbuild-std=core
73+
```
74+
75+
Now find the <libname>.ll under target/amdgcn-amd-amdhsa folder and copy it to a device.ll file (or adjust the file names below).
76+
If you work on an NVIDIA or Intel gpu, please adjust the names acordingly and open an issue to share your results (either if you succeed or fail).
77+
First we compile our .ll files (good for manual inspections) to .bc files and clean up leftover artifacts. The cleanup is important, otherwise caching might interfere on following runs.
78+
```
79+
opt lib.ll -o lib.bc
80+
opt device.ll -o device.bc
81+
rm *.o
82+
rm bare.amdgcn.gfx90a.img*
83+
```
84+
85+
```
86+
clang-offload-packager" "-o" "host.out" "--image=file=device.bc,triple=amdgcn-amd-amdhsa,arch=gfx90a,kind=openmp"
87+
88+
clang-21" "-cc1" "-triple" "x86_64-unknown-linux-gnu" "-S" "-save-temps=cwd" "-disable-free" "-clear-ast-before-backend" "-main-file-name" "lib.rs" "-mrelocation-model" "pic" "-pic-level" "2" "-pic-is-pie" "-mframe-pointer=all" "-fmath-errno" "-ffp-contract=on" "-fno-rounding-math" "-mconstructor-aliases" "-funwind-tables=2" "-target-cpu" "x86-64" "-tune-cpu" "generic" "-resource-dir" "/<ABSOLUTE_PATH_TO>/rust/build/x86_64-unknown-linux-gnu/llvm/lib/clang/21" "-ferror-limit" "19" "-fopenmp" "-fopenmp-offload-mandatory" "-fgnuc-version=4.2.1" "-fskip-odr-check-in-gmf" "-fembed-offload-object=host.out" "-fopenmp-targets=amdgcn-amd-amdhsa" "-faddrsig" "-D__GCC_HAVE_DWARF2_CFI_ASM=1" "-o" "host.s" "-x" "ir" "lib.bc"
89+
90+
clang-21" "-cc1as" "-triple" "x86_64-unknown-linux-gnu" "-filetype" "obj" "-main-file-name" "lib.rs" "-target-cpu" "x86-64" "-mrelocation-model" "pic" "-o" "host.o" "host.s"
91+
92+
clang-linker-wrapper" "--should-extract=gfx90a" "--device-compiler=amdgcn-amd-amdhsa=-g" "--device-compiler=amdgcn-amd-amdhsa=-save-temps=cwd" "--device-linker=amdgcn-amd-amdhsa=-lompdevice" "--host-triple=x86_64-unknown-linux-gnu" "--save-temps" "--linker-path=/ABSOlUTE_PATH_TO/rust/build/x86_64-unknown-linux-gnu/lld/bin/ld.lld" "--hash-style=gnu" "--eh-frame-hdr" "-m" "elf_x86_64" "-pie" "-dynamic-linker" "/lib64/ld-linux-x86-64.so.2" "-o" "bare" "/lib/../lib64/Scrt1.o" "/lib/../lib64/crti.o" "/ABSOLUTE_PATH_TO/crtbeginS.o" "-L/ABSOLUTE_PATH_TO/rust/build/x86_64-unknown-linux-gnu/llvm/bin/../lib/x86_64-unknown-linux-gnu" "-L/ABSOLUTE_PATH_TO/rust/build/x86_64-unknown-linux-gnu/llvm/lib/clang/21/lib/x86_64-unknown-linux-gnu" "-L/lib/../lib64" "-L/usr/lib64" "-L/lib" "-L/usr/lib" "host.o" "-lstdc++" "-lm" "-lomp" "-lomptarget" "-L/ABSOLUTE_PATH_TO/rust/build/x86_64-unknown-linux-gnu/llvm/lib" "-lgcc_s" "-lgcc" "-lpthread" "-lc" "-lgcc_s" "-lgcc" "/ABSOLUTE_PATH_TO/crtendS.o" "/lib/../lib64/crtn.o"
93+
```
94+
95+
Especially for the last command I recommend to not fix the paths, but rather just re-generate them by copying a bare-mode openmp example and compiling it with your clang. By adding `-###` to your clang invocation, you can see the invidual steps.
96+
```
97+
myclang++ -fuse-ld=lld -O3 -fopenmp -fopenmp-offload-mandatory --offload-arch=gfx90a omp_bare.cpp -o main -###
98+
```
99+
100+
In the final step, you can now run your binary
101+
102+
```
103+
./main
104+
The first element is zero 0.000000
105+
The first element is NOT zero 21.000000
106+
The second element is 0.000000
107+
```
108+
109+
To receive more information about the memory transfer, you can enable info printing with
110+
```
111+
LIBOMPTARGET_INFO=-1 ./main
112+
```

0 commit comments

Comments
 (0)