Description
Automatically determine which native features a kernel actually needs and skip unused C++ headers during JIT compilation, for both the CPU (Clang) and CUDA (NVRTC) paths.
Context
When investigating adding bfloat16 support to Warp, it was found that the new headers increased CPU cold-compile times for unrelated kernels by ~60%, even kernels that never touch bfloat16. This is because builtin.h unconditionally includes all native headers (mesh, volume, tile, noise, mat, float16 adjoint instantiations, etc.), so every kernel pays the full parsing cost regardless of what it actually uses. A simple scalar assignment kernel takes ~1.6s to cold-compile on CPU. The vast majority of that time is spent in header parsing rather than compiling the kernel itself.
As more features are added to Warp, this problem will only get worse. Each new header inclusion raises the compile-time floor for every kernel.