Due to the legacy (OpenMP was the first the backend) launcher code generation is awkward and buggy for device kernels" CUDA. HIP, OpenCL, Metal.
The problem is that:
- Historically launcher code is based on generated backend code with inheriting all internal stuff that makes sense only for device side(user function, intrinsic, internal data structure);
- It make hard to make abstract implementation and we end with leaks of abstraction to launcher;
- This legacy also force the new transpiler make dirty workarounds and complicate the code without any benefits for user and clean and maintainable codebase.
For those reason we recommend OCCA community to get rid of this legacy and generate clean and minimum launcher code to run the device kernel.