Skip to content

Conversation

Copy link

Copilot AI commented Nov 16, 2025

Implements debug logging wrappers for all CUDA, cuBLAS, and cuDNN API calls AND kernel launches used in src/neural/backends/cuda to aid in debugging GPU operations.

Changes

New file: src/neural/backends/cuda/cuda_wrapper.h

  • 60+ inline wrapper functions covering:
    • CUDA Runtime API (cudaMalloc, cudaMemcpy, cudaStreamCreate*, cudaEvent*, etc.)
    • cuBLAS API (cublasSgemm, cublasHgemm, cublasGemmStridedBatchedEx, etc.)
    • cuDNN API (cudnnConvolutionForward, cudnnSetTensor4dDescriptor, etc.)
  • CUDA_KERNEL_LAUNCH_LOG macro for logging all kernel launches
  • Template-based wrappers for pointer parameters to avoid ambiguous overload errors
  • All wrappers in lczero::cudnn_backend namespace
  • Each logs function name, input parameters, and return values via LOGFILE
  • Kernel launch logging includes kernel name, grid/block dimensions, shared memory, and stream
  • Controlled by CUDA_WRAPPER_DEBUG compile-time flag (default: 0 for zero overhead)

Modified: src/neural/backends/cuda/cuda_common.h

  • Added #include "cuda_wrapper.h"

Modified: src/neural/backends/cuda/common_kernels.cu

  • Added logging before all 34 kernel launches

Modified: src/neural/backends/cuda/fp16_kernels.cu

  • Added logging before all 17 kernel launches

Implementation

Wrappers leverage C++ name lookup: code in lczero::cudnn_backend resolves to namespace-scoped wrappers first, which then call global API via ::cudaMalloc etc. to avoid recursion.

Template Design: Wrapper functions use C++ templates for pointer parameters (e.g., template <typename T> cudaMalloc(T** devPtr, size_t size)) to:

  • Avoid ambiguous overload errors with void* and const void*
  • Support proper type deduction from call sites
  • Maintain compatibility with all CUDA API pointer types (float*, half*, int*, etc.)
  • Forward correctly to the original CUDA API functions

Kernel launch logging is added before each kernel launch using the CUDA_KERNEL_LAUNCH_LOG macro, which logs the kernel name and execution configuration.

Example output when CUDA_WRAPPER_DEBUG=1:

[CUDA_WRAPPER] cudaMalloc(devPtr=0x7ffe1234, size=1024)
[CUDA_WRAPPER] cudaMalloc -> no error, *devPtr=0x7f8b00000000
[CUDA_WRAPPER] Kernel launch: addVectors_kernel grid=(4,1,1) block=(256,1,1) smem=0 stream=0x...
[CUDA_WRAPPER] cublasSgemm(handle=0x..., m=64, n=64, k=128, ...)
[CUDA_WRAPPER] cublasSgemm -> 0

All existing CUDA API calls and kernel launches in the backend transparently route through wrappers without code changes to calling code.

[!WARNING]

Original prompt

Implement debug print wrapper functions for all used cuda and cublass calls in
src/neural/backends/cuda. Create a new header called cuda_wrapper.h. Ensure that
each wrapper function logs the function name, input parameters, and output
parameters. Wrappers should use the same signatures as the original functions
but be in lczero::cudnn_backend namespace. Commit changes to a new branch called cuda-backend-graph-debug


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI self-assigned this Nov 16, 2025
Copilot AI changed the title [WIP] Add debug print wrapper functions for CUDA and cuBLAS calls Add debug wrapper functions for CUDA/cuBLAS/cuDNN API calls Nov 16, 2025
Copilot AI requested a review from Menkib64 November 16, 2025 08:22
@Menkib64
Copy link
Owner

@copilot Add logging to all kernel launches in src/neural/backends/cuda. Logging should be conditionally enabled using CUDA_WRAPPER_DEBUG define.

- Added CUDA_KERNEL_LAUNCH_LOG macro to cuda_wrapper.h
- Macro logs kernel name, grid dimensions, block dimensions, shared memory, and stream
- Added logging before all 51 kernel launches in common_kernels.cu and fp16_kernels.cu
- Logging is conditionally enabled via CUDA_WRAPPER_DEBUG define (default: disabled)

Co-authored-by: Menkib64 <[email protected]>
Copy link
Author

Copilot AI commented Nov 16, 2025

@copilot Add logging to all kernel launches in src/neural/backends/cuda. Logging should be conditionally enabled using CUDA_WRAPPER_DEBUG define.

Added kernel launch logging for all 51 CUDA kernel launches in src/neural/backends/cuda (34 in common_kernels.cu + 17 in fp16_kernels.cu).

The logging is controlled by the CUDA_WRAPPER_DEBUG define and logs:

  • Kernel name
  • Grid dimensions (x, y, z)
  • Block dimensions (x, y, z)
  • Shared memory size
  • Stream pointer

Commit: 10250a5

Copilot AI changed the title Add debug wrapper functions for CUDA/cuBLAS/cuDNN API calls Add debug wrapper functions for CUDA/cuBLAS/cuDNN API calls and kernel launches Nov 16, 2025
@Menkib64
Copy link
Owner

@copilot Make all wrappers use template types to fix ambiguous overload compiler errors.

- Converted cudaMalloc, cudaFree, cudaMemcpy, cudaMemcpyAsync, cudaMemset to use templates
- Converted cudaFuncSetAttribute to use template for function pointer
- Converted cublasGemmStridedBatchedEx to use templates for data pointers
- Converted cudnnConvolutionForward, cudnnConvolutionBiasActivationForward to use templates
- Converted cudnnActivationForward, cudnnAddTensor to use templates
- Templates allow proper type deduction and avoid ambiguous overload errors
- All templates forward to the original CUDA API with appropriate casts where needed

Co-authored-by: Menkib64 <[email protected]>
Copy link
Author

Copilot AI commented Nov 16, 2025

@copilot Make all wrappers use template types to fix ambiguous overload compiler errors.

Converted all pointer-based wrappers to use template types to avoid ambiguous overload errors.

Changed wrappers:

  • CUDA Runtime: cudaMalloc, cudaFree, cudaMemcpy, cudaMemcpyAsync, cudaMemset, cudaFuncSetAttribute
  • cuBLAS: cublasGemmStridedBatchedEx
  • cuDNN: cudnnConvolutionForward, cudnnConvolutionBiasActivationForward, cudnnActivationForward, cudnnAddTensor

Templates enable proper type deduction from call sites and forward correctly to the original CUDA API functions.

Commit: acf5cfa

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants