CUDA black magic

find it in Triton.

fp8 is ~100 tflops faster when the kernel name has "cutlass" in it.

You can reproduce it by running the command below — no need to build Triton. The only difference between gluon_attention.ptx and cutlass_gluon_attention.ptx lies in their function names.

wget https://developer.download.nvidia.com/compute/cuda/redist/cuda_nvcc/linux-x86_64/cuda_nvcc-linux-x86_64-12.8.93-archive.tar.xz
tar -xf cuda_nvcc-linux-x86_64-12.8.93-archive.tar.xz

git clone https://github.com/OpenMLIR/cuda-magic
cd cuda-magic
cuda_nvcc-linux-x86_64-12.8.93-archive/bin/ptxas -lineinfo -v --gpu-name=sm_100a triton_cache/gluon_attention/attention_kernel.ptx -o gluon_attention.cubin
cuda_nvcc-linux-x86_64-12.8.93-archive/bin/ptxas -lineinfo -v --gpu-name=sm_100a triton_cache/cutlass_gluon_attention/attention_kernel.ptx -o cutlass_gluon_attention.cubin

You can use ls -lh to check the sizes of different .cubin files.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
triton_cache		triton_cache
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CUDA black magic

About

Uh oh!

Releases

Packages

OpenMLIR/cuda-magic

Folders and files

Latest commit

History

Repository files navigation

CUDA black magic

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages