I have not worked with e.g. CUDA.WMMA before, but perhaps as part of the CUDA extension, a method can be defined for leveraging the tensor cores instead of the mma! which I assume only uses e.g. CUDA cores.
I did not see any mention of MMA/WMMA in KA.jl.