Add debug wrapper functions for CUDA/cuBLAS/cuDNN API calls and kernel launches #1

Copilot · 2025-11-16T08:04:54Z

Implements debug logging wrappers for all CUDA, cuBLAS, and cuDNN API calls AND kernel launches used in src/neural/backends/cuda to aid in debugging GPU operations.

Changes

New file: src/neural/backends/cuda/cuda_wrapper.h

60+ inline wrapper functions covering:
- CUDA Runtime API (cudaMalloc, cudaMemcpy, cudaStreamCreate*, cudaEvent*, etc.)
- cuBLAS API (cublasSgemm, cublasHgemm, cublasGemmStridedBatchedEx, etc.)
- cuDNN API (cudnnConvolutionForward, cudnnSetTensor4dDescriptor, etc.)
CUDA_KERNEL_LAUNCH_LOG macro for logging all kernel launches
Template-based wrappers for pointer parameters to avoid ambiguous overload errors
All wrappers in lczero::cudnn_backend namespace
Each logs function name, input parameters, and return values via LOGFILE
Kernel launch logging includes kernel name, grid/block dimensions, shared memory, and stream
Controlled by CUDA_WRAPPER_DEBUG compile-time flag (default: 0 for zero overhead)

Modified: src/neural/backends/cuda/cuda_common.h

Added #include "cuda_wrapper.h"

Modified: src/neural/backends/cuda/common_kernels.cu

Added logging before all 34 kernel launches

Modified: src/neural/backends/cuda/fp16_kernels.cu

Added logging before all 17 kernel launches

Implementation

Wrappers leverage C++ name lookup: code in lczero::cudnn_backend resolves to namespace-scoped wrappers first, which then call global API via ::cudaMalloc etc. to avoid recursion.

Template Design: Wrapper functions use C++ templates for pointer parameters (e.g., template <typename T> cudaMalloc(T** devPtr, size_t size)) to:

Avoid ambiguous overload errors with void* and const void*
Support proper type deduction from call sites
Maintain compatibility with all CUDA API pointer types (float*, half*, int*, etc.)
Forward correctly to the original CUDA API functions

Kernel launch logging is added before each kernel launch using the CUDA_KERNEL_LAUNCH_LOG macro, which logs the kernel name and execution configuration.

Example output when CUDA_WRAPPER_DEBUG=1:

[CUDA_WRAPPER] cudaMalloc(devPtr=0x7ffe1234, size=1024)
[CUDA_WRAPPER] cudaMalloc -> no error, *devPtr=0x7f8b00000000
[CUDA_WRAPPER] Kernel launch: addVectors_kernel grid=(4,1,1) block=(256,1,1) smem=0 stream=0x...
[CUDA_WRAPPER] cublasSgemm(handle=0x..., m=64, n=64, k=128, ...)
[CUDA_WRAPPER] cublasSgemm -> 0

All existing CUDA API calls and kernel launches in the backend transparently route through wrappers without code changes to calling code.

[!WARNING]

Original prompt

Implement debug print wrapper functions for all used cuda and cublass calls in
src/neural/backends/cuda. Create a new header called cuda_wrapper.h. Ensure that
each wrapper function logs the function name, input parameters, and output
parameters. Wrappers should use the same signatures as the original functions
but be in lczero::cudnn_backend namespace. Commit changes to a new branch called cuda-backend-graph-debug

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Co-authored-by: Menkib64 <[email protected]>

Menkib64 · 2025-11-16T08:25:40Z

@copilot Add logging to all kernel launches in src/neural/backends/cuda. Logging should be conditionally enabled using CUDA_WRAPPER_DEBUG define.

- Added CUDA_KERNEL_LAUNCH_LOG macro to cuda_wrapper.h - Macro logs kernel name, grid dimensions, block dimensions, shared memory, and stream - Added logging before all 51 kernel launches in common_kernels.cu and fp16_kernels.cu - Logging is conditionally enabled via CUDA_WRAPPER_DEBUG define (default: disabled) Co-authored-by: Menkib64 <[email protected]>

Copilot · 2025-11-16T08:37:04Z

@copilot Add logging to all kernel launches in src/neural/backends/cuda. Logging should be conditionally enabled using CUDA_WRAPPER_DEBUG define.

Added kernel launch logging for all 51 CUDA kernel launches in src/neural/backends/cuda (34 in common_kernels.cu + 17 in fp16_kernels.cu).

The logging is controlled by the CUDA_WRAPPER_DEBUG define and logs:

Kernel name
Grid dimensions (x, y, z)
Block dimensions (x, y, z)
Shared memory size
Stream pointer

Commit: 10250a5

Menkib64 · 2025-11-16T08:58:58Z

@copilot Make all wrappers use template types to fix ambiguous overload compiler errors.

- Converted cudaMalloc, cudaFree, cudaMemcpy, cudaMemcpyAsync, cudaMemset to use templates - Converted cudaFuncSetAttribute to use template for function pointer - Converted cublasGemmStridedBatchedEx to use templates for data pointers - Converted cudnnConvolutionForward, cudnnConvolutionBiasActivationForward to use templates - Converted cudnnActivationForward, cudnnAddTensor to use templates - Templates allow proper type deduction and avoid ambiguous overload errors - All templates forward to the original CUDA API with appropriate casts where needed Co-authored-by: Menkib64 <[email protected]>

Copilot · 2025-11-16T09:03:10Z

@copilot Make all wrappers use template types to fix ambiguous overload compiler errors.

Converted all pointer-based wrappers to use template types to avoid ambiguous overload errors.

Changed wrappers:

CUDA Runtime: cudaMalloc, cudaFree, cudaMemcpy, cudaMemcpyAsync, cudaMemset, cudaFuncSetAttribute
cuBLAS: cublasGemmStridedBatchedEx
cuDNN: cudnnConvolutionForward, cudnnConvolutionBiasActivationForward, cudnnActivationForward, cudnnAddTensor

Templates enable proper type deduction from call sites and forward correctly to the original CUDA API functions.

Commit: acf5cfa

Copilot AI self-assigned this Nov 16, 2025

Initial plan

6586413

Copilot AI assigned Menkib64 Nov 16, 2025

Copilot started work on behalf of Menkib64 November 16, 2025 08:05 View session

Copilot AI and others added 3 commits November 16, 2025 08:11

Add cuda_wrapper.h with debug logging for CUDA/cuBLAS/cuDNN calls

83094fe

Co-authored-by: Menkib64 <[email protected]>

Final verification and documentation complete

d49c51e

Co-authored-by: Menkib64 <[email protected]>

Remove accidentally added symlink

237ead8

Copilot AI changed the title ~~[WIP] Add debug print wrapper functions for CUDA and cuBLAS calls~~ Add debug wrapper functions for CUDA/cuBLAS/cuDNN API calls Nov 16, 2025

Copilot AI requested a review from Menkib64 November 16, 2025 08:22

Copilot finished work on behalf of Menkib64 November 16, 2025 08:22

Copilot started work on behalf of Menkib64 November 16, 2025 08:26 View session

Copilot AI changed the title ~~Add debug wrapper functions for CUDA/cuBLAS/cuDNN API calls~~ Add debug wrapper functions for CUDA/cuBLAS/cuDNN API calls and kernel launches Nov 16, 2025

Copilot finished work on behalf of Menkib64 November 16, 2025 08:38

Copilot started work on behalf of Menkib64 November 16, 2025 08:59 View session

Copilot finished work on behalf of Menkib64 November 16, 2025 09:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add debug wrapper functions for CUDA/cuBLAS/cuDNN API calls and kernel launches #1

Add debug wrapper functions for CUDA/cuBLAS/cuDNN API calls and kernel launches #1

Uh oh!

Copilot AI commented Nov 16, 2025 •

edited

Loading

Uh oh!

Menkib64 commented Nov 16, 2025

Uh oh!

Copilot AI commented Nov 16, 2025

Uh oh!

Menkib64 commented Nov 16, 2025

Uh oh!

Copilot AI commented Nov 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add debug wrapper functions for CUDA/cuBLAS/cuDNN API calls and kernel launches #1

Are you sure you want to change the base?

Add debug wrapper functions for CUDA/cuBLAS/cuDNN API calls and kernel launches #1

Uh oh!

Conversation

Copilot AI commented Nov 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Implementation

Uh oh!

Menkib64 commented Nov 16, 2025

Uh oh!

Copilot AI commented Nov 16, 2025

Uh oh!

Menkib64 commented Nov 16, 2025

Uh oh!

Copilot AI commented Nov 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Nov 16, 2025 •

edited

Loading