Skip to content

CUDA 12.8.1 and LLVM 18.1.8 #197

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
realJohnDee opened this issue Apr 13, 2025 · 17 comments
Open

CUDA 12.8.1 and LLVM 18.1.8 #197

realJohnDee opened this issue Apr 13, 2025 · 17 comments

Comments

@realJohnDee
Copy link

CUDA 12.8 Update 1 Release Notes state:
libNVVM - Compilation of compute capabilities compute_100 and greater (Blackwell and future architectures) uses an updated NVVM IR dialect, based on LLVM 18.1.8 IR (the “modern” dialect)

As I have a Blackwell card, I'll be attempting to implement support for LLVM 18.1.8 with the new pass manager in rustc_codegen_nvvm/rustc_llvm_wrapper/PassWrapper.cpp

But my question is, would you like to preserve compatibility with very old LLVM versions? For example #ifdef LLVM_VERSION_LT(18,1).

The passes with the new PassManager in LLVM 18 are vastly different and it would complicate things to perserve compatibility for very old LLVM version, and the purpose of that is questionable as most people will download the latest CUDA toolkit.

In addition, LLVM aims to preserve backwards IR compatibility:
https://llvm.org/docs/DeveloperPolicy.html#ir-backwards-compatibility

In my experimentation, I'm deleting a lot of code for backwards compatibility so I thought I'd ask if the effort will be worth it, or will the Rust-CUDA project focus on LLVM 7 going forward.

Thank you, Rust CUDA is exactly what we need.

@LegNeato
Copy link
Contributor

Awesome! Yes, we'll want to keep llvm 7 so that current cards continue to work. Blackwell+ only is cutting out way too many devices. I am pretty sure LLVM 18.1.8 cannot generate both, right?

@jorge-ortega
Copy link
Collaborator

Supporting Blackwell is a far more difficult task then just updating the pass wrapper. TBH, I have no idea what the pass wrapper currently does, as I haven't really figured out how the nvvm codegen uses the extern's from the Pass Wrapper.

LLVM's IR backwards compatibly do not apply to libNVVM. NVVM docs explicitly callout the modern version, based on 18, is only for Blackwell and later architectures. Atm, I don't see how we can do this without dynamically linking to the LLVM version needed or having two different backends: one that statically links to 7 and the other to 18. Both would require reworking how the backend calls LLVM APIs, in addition to accounting for the differences with opaque pointers in 18, and typed pointers in 7.

@realJohnDee
Copy link
Author

Awesome! Yes, we'll want to keep llvm 7 so that current cards continue to work. Blackwell+ only is cutting out way too many devices. I am pretty sure LLVM 18.1.8 cannot generate both, right?

I don't know, compiler infrastructure is way over my head with many new concepts for me. From my understanding, the TargetMachine.cpp and pass manager (PassBuilder) in LLVM would generate architecture specific IR through the "-mcpu=sm_XYZ" compiler flag and not necessarily for sm_100 (Blackwell), meaning that a later LLVM can generate the IR for all previous versions.

I have no idea what the pass wrapper currently does, as I haven't really figured out how the nvvm codegen uses the extern's from the Pass Wrapper.

I don't know either, I just started looking at it.

I don't see how we can do this without dynamically linking to the LLVM version needed or having two different backends

Ah I see, so the IR dialect changed from 2.0 (LLVM 7) to the current 4.0 (LLVM 18) without backward compatibility, and this changed the behavior of pointers, requiring new FFI bindings.

But from what I understand, CUDA 12.8.1 and LLVM 18.1.8 does not need to generate IR 2.0 to be backward compatible, it generates IR 4.0 that is tailored for legacy architectures, because the resulting PTX code will be able to run on all devices.

"CUDA 12.8.1 standardizes on NVVM IR 4.0, using the NVPTX backend to adapt IR for older GPUs", this occurs through the pass manager that has NVPTX passes.

So we can deal with opaque pointers in FFI because it's targeting IR 4.0, which the NVPTX passes in the LLVM 18 pass manager will optimize for older PTX instructions.

In essence, rust-cuda can standardize on CUDA 12 and let the LLVM 18 with opaque pointers and let the pass manager generate the correct PTX based on target architecture (TargetMachine.h, -mcpu=sm_XYZ).

But correct me if I don't understand this correctly. We need to investigate further for a definitive answer.

@LegNeato
Copy link
Contributor

Ok, that was not my understanding but perhaps I am wrong. I did previously miss that point about older GPUs!

@jorge-ortega
Copy link
Collaborator

Ah I see, so the IR dialect changed from 2.0 (LLVM 7) to the current 4.0 (LLVM 18) without backward compatibility, and this changed the behavior of pointers, requiring new FFI bindings.

NVVM IR is LLVM IR with some tweaks. NVVM in CUDA 12.8 did not change or remove the old dialect. The old LLVM 7 based dialect is still accepted, which is how the codegen still supports CUDA 12.8 today. For Blackwell specifically, a new dialectic based on LLVM 18 was introduced.

libNVVM doesn't really care how you generate the IR. You could give it handwritten text IR, but the obvious choice is to use LLVM IR Builder APIs to generate it. This is what the codegen uses today to generate the IR it passes to NVVM (with target compute architecture) which then generates PTX. From what I understand, the current pass managers only modify the IR generated by the codegen for some optimizations before passing that to NVVM.

To do the same for Blackwell support, we would use LLVM 18 IR builder APIs, and that's the only reason we would need new FFI bindings. The difference between opaque/typed pointers are because of changes in the IR between 7 and 18. This only affects where the codegen needs to insert pointer casts when generating IR 7 that would not be needed in 18.

It is worth noting that LLVM does support PTX as a target which is another way we could generate PTX. This would mean we drop libNVVM and use LLVM NVPTX backend directly, wiring the passes to generate the PTX for any computer target. This is the only way I see where we wouldn't need both LLVM 7 and 18.

But from what I understand, CUDA 12.8.1 and LLVM 18.1.8 does not need to generate IR 2.0 to be backward compatible, it generates IR 4.0 that is tailored for legacy architectures, because the resulting PTX code will be able to run on all devices.

Yes, but only generating the modern dialect would mean we would only be able to support CUDA 12.8.

"CUDA 12.8.1 standardizes on NVVM IR 4.0, using the NVPTX backend to adapt IR for older GPUs", this occurs through the pass manager that has NVPTX passes.

I don't find this in the CUDA release notes. Could you provide the reference?

Happy to chat further any design work for this and answer any questions to help get familiar with how codegen works today.

@adamcavendish
Copy link
Contributor

To complement the story, we would also need to add a huge amount of instructions in our manually added IR to support TensorCore so we can support the recent hardwares. The TensorCore instructions are not CUDA at all lol.

@realJohnDee
Copy link
Author

To complement the story, we would also need to add a huge amount of instructions in our manually added IR to support TensorCore so we can support the recent hardwares. The TensorCore instructions are not CUDA at all

Happy to chat further any design work for this and answer any questions to help get familiar with how codegen works today.

I'm sorry for my confusion, I didn't understand that NVVM IR differred from LLVM IR, and was thinking in LLVM terms where the latest dialect of the Intermediate Representation is compiled to machine code for different architectures, but this doesn't seem to be the case as libnvvm uses a priprietary IR that is only based-on LLVM IR, but breaks backwards-compatibility in contrast to the LLVM philosophy.

So libnvvm does need to be provided a previous version of the proprietary NVVM IR to work with pre-Blackwell cards as it's not standard LLVM IR. Did I get this right?

That makes this effort to support Blackwell, too big, as I thought we could just use standard LLVM APIs and then link to libnvvm.

I guess you can close this issue if there isn't anything else to sort out.

@LegNeato
Copy link
Contributor

Yeah, the key is:

"CUDA 12.8.1 standardizes on NVVM IR 4.0, using the NVPTX backend to adapt IR for older GPUs", this occurs through the pass manager that has NVPTX passes."

None of us have seen this, and it would change the version story here. Where was this found? Or was it an AI hallucination? (no worries if so!)

@realJohnDee
Copy link
Author

Yep, something along the lines of Deep Research :)

The release notes do not explicitly list a reference for "CUDA 12.8.1 standardizes on NVVM IR 4.0, using the NVPTX backend to adapt IR for older GPUs," as mentioned in the discussion. However, the behavior is inferred from the dialect support and PTX ISA updates, such as PTX 8.7 adding support for sm_120 (likely Blackwell), with new instructions like extensions to mma and cvt (PTX ISA 8.7 documentation).

@jorge-ortega
Copy link
Collaborator

It is worth noting that LLVM does support PTX as a target which is another way we could generate PTX. This would mean we drop libNVVM and use LLVM NVPTX backend directly, wiring the passes to generate the PTX for any computer target. This is the only way I see where we wouldn't need both LLVM 7 and 18.

I realized that this is likely how mainline rustc supports PTX target today. Don't know if it supports compute_100 though.

@realJohnDee
Copy link
Author

realJohnDee commented Apr 15, 2025

https://github.com/llvm/llvm-project/blob/f46cea5b42ed4d05fd11dc1e693ddc6153769cde/llvm/lib/Target/NVPTX/NVPTX.td#L80

NVPTX supported processors.

This file lists sm_20 to sm_120/sm_120a, so all of them. sm_100 should be server platforms while 120 Blackwell.

It looks like LLVM uses standard LLVM IR while libNVVM uses NVVM IR with extensions. This may simplify things further as Rust code already has an LLVM IR generation backend.

I had difficulty figuring out how to use the NVPTX backend because llvm doesn't seem to export the headers to the user upon compilation of version 18.1.8, and I was copying NVPTX headers into rust-cuda, that may have changed in later versions if one can decouple LLVM from libNVVM.

So is there still hope for even using the latest LLVM that Rust uses? I'm not sure what it means for performance to avoid using libNVVM optimizations.

@adamcavendish
Copy link
Contributor

It is worth noting that LLVM does support PTX as a target which is another way we could generate PTX. This would mean we drop libNVVM and use LLVM NVPTX backend directly, wiring the passes to generate the PTX for any computer target. This is the only way I see where we wouldn't need both LLVM 7 and 18.

I realized that this is likely how mainline rustc supports PTX target today. Don't know if it supports compute_100 though.

I think that "LLVM supports PTX as a target", it is used for allowing clang to compile and build CUDA files.

@realJohnDee
Copy link
Author

realJohnDee commented Apr 15, 2025

I just compiled LLVM 20.1.2 to check if it exposes the headers to the user.

user@host /storage/src/misc/llvm $ find llvm-sources -type f -name NVPTX.h
llvm-sources/clang/lib/Basic/Targets/NVPTX.h
llvm-sources/llvm/lib/Target/NVPTX/NVPTX.h
user@host /storage/src/misc/llvm $ find llvm-install/include/ -type f | grep NVPTX
llvm-install/include/llvm/IR/IntrinsicsNVPTX.h
llvm-install/include/llvm/Support/NVPTXAddrSpace.h

It still doesn't export NVPTX.h, which is required to initiate the PTX passes in the pass manager.

user@host /storage/src/misc/llvm $ llvm-install/bin/opt --print-passes | grep -i nvptx

It shows that the nvptx pass isn't active by default and requires access to NVPTX.h to add that pass. So I coped the headers for the Target to rust-cuda to be able to use the passes and link to LLVM. This is where I dropped the project as I thought it would be impossible to implement with a LLVM other than 7.

I think that "LLVM supports PTX as a target", it is used for allowing clang to compile and build CUDA files.

Do you think one can still integrate with its IR infrastructure? The NVPTX.td, as posted earlier, seems to indicate that it supports all architectures and we may be able to add custom IR where NVVM has extensions.

@adamcavendish
Copy link
Contributor

adamcavendish commented Apr 15, 2025

I just compiled LLVM 20.1.2 to check if it exposes the headers to the user.

user@host /storage/src/misc/llvm $ find llvm-sources -type f -name NVPTX.h
llvm-sources/clang/lib/Basic/Targets/NVPTX.h
llvm-sources/llvm/lib/Target/NVPTX/NVPTX.h
user@host /storage/src/misc/llvm $ find llvm-install/include/ -type f | grep NVPTX
llvm-install/include/llvm/IR/IntrinsicsNVPTX.h
llvm-install/include/llvm/Support/NVPTXAddrSpace.h

It still doesn't export NVPTX.h, which is required to initiate the PTX passes in the pass manager.

user@host /storage/src/misc/llvm $ llvm-install/bin/opt --print-passes | grep -i nvptx

It shows that the nvptx pass isn't active by default and requires access to NVPTX.h to add that pass. So I coped the headers for the Target to rust-cuda to be able to use the passes and link to LLVM. This is where I dropped the project as I thought it would be impossible to implement with a LLVM other than 7.

I think that "LLVM supports PTX as a target", it is used for allowing clang to compile and build CUDA files.

Do you think one can still integrate with its IR infrastructure? The NVPTX.td, as posted earlier, seems to indicate that it supports all architectures and we may be able to add custom IR where NVVM has extensions.

clang handles a large portion of CUDA code (by detecting .cu files or via -x cuda driver) while rustc doesn't. If the cuda driver is enabled, clang is allowed to target nvptx64-nvidia-cuda triple. Even the host cc compiler is used after the host code and device code is separated by either nvcc or clang.

Reference: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#the-cuda-compilation-trajectory

As indicated by the trajectory, the IR should be a part of frontend, but not a part of backend. The best would be modifying rustc frontend to support various CUDA IR extensions gated by something __device__ alike function attributes.

@realJohnDee
Copy link
Author

realJohnDee commented Apr 15, 2025

Yep, I understand it a bit better now. I wrote a benchmark and compared nvcc with the latest clang trunk.

/opt/cuda/bin/nvcc -std=c++20 --expt-relaxed-constexpr --expt-extended-lambda -Xcompiler -Wall,-Werror -O3 --use_fast_math -arch=sm_120 -I/opt/cuda/include -Iincludes -x cu -c benchmark/benchmark.cpp -o target/release/.objects/benchmark/benchmark.o
/opt/cuda/bin/nvcc -std=c++20 --expt-relaxed-constexpr --expt-extended-lambda -Xcompiler -Wall,-Werror -O3 --use_fast_math -arch=sm_120 -I/opt/cuda/include -Iincludes -L/opt/cuda/lib64 -Xlinker --no-undefined -lcudart -lnvidia-ml -o target/release/benchmark.bin target/release/.objects/benchmark/benchmark.o

/storage/src/misc/llvm/llvm-install/bin/clang++ -std=c++20 --cuda-path=/opt/cuda --cuda-gpu-arch=sm_120 -I/opt/cuda/include -Iincludes -O3 -ffast-math -c benchmark/benchmark.cu -o target/release/.objects/benchmark/benchmark.o
/storage/src/misc/llvm/llvm-install/bin/clang++ -std=c++20 --cuda-path=/opt/cuda --cuda-gpu-arch=sm_120 -L/opt/cuda/lib64 -lcudart -lnvidia-ml target/release/.objects/benchmark/benchmark.o -o target/release/benchmark.bin

And here are my findings, it shows that the NVPTX IR is well supported up until Blackwell, it is both more, and less optimized depending on usage. So just using plain LLVM IR, as Rust already uses, proves to be viable.

BF16 GFLOPS:
Clang: ~49.6% higher mean (211.8 vs. 141.5), with higher peaks (1015 vs. 303).
Interpretation: Clang’s NVPTX backend may generate better PTX for bfloat16 WMMA in some cases, possibly due to different register allocation or instruction scheduling. However, high variability (stddev 110 vs. 30) suggests inconsistent performance, likely from stream interactions or cache effects.

FP16 GFLOPS:
nvcc: ~163.8% higher mean (1017.5 vs. 385.6), despite similar peaks (~1025).
Interpretation: nvcc optimizes FP16 WMMA more consistently, leveraging Blackwell’s Tensor Cores better. Clang’s lower average and high variance (stddev 350 vs. 70) indicate potential issues with FP16 intrinsics or warp scheduling, possibly tied to partial CUDA 12.8 support.

Mem GOPS:
Equal: ~281.2 for both (stddev ~0.2–0.3).
Interpretation: Memory kernel performance is insensitive to compiler, as it’s bandwidth-bound (HBM3, 448 GB/s). Both generate similar PTX for global/shared memory accesses.

Power Draw:
Clang: ~7.3% higher (233.0W vs. 217.2W).
Interpretation: Clang’s codegen may use more registers or issue denser instructions, increasing power slightly. Neither hits 300W TDP, suggesting room for more stress (e.g., larger N, INT8 kernel).

Temperature:
Clang: ~4.5% higher (67.9°C vs. 65.0°C).
Interpretation: Reflects higher power draw, but both are safe (<90°C).

Now I have never used CUDA before since yesterday, so your input would be valuable on whether to proceed and how, perhaps by looking at ways of integrating the codegen backend somehow.

I noted that LLVM 19 didn't support sm_100, I had to compile LLVM 20 which did, and later trunk to test if a bug was due to an unsupported LLVM, but it works overall.

@LegNeato
Copy link
Contributor

FWIW, NVIDIA mentioned to me long-term it is likely better to target nvvm rather than ptx directly.

@LegNeato
Copy link
Contributor

But really, both should be supported and just different targets (kinda how it is now where we support nvvm and the rustc backend only does ptx).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants