merge master #1

ContradNamiseb · 2025-07-13T14:22:29Z

No description provided.

…imit was set to 0, moving 'set node_limit to 4000000000 if it wasnt initialized' from stopper.h to common.cc (#2056) fix for go nodes 0

…ceeding with execution (#2062)

OpenCL extension can make incompatible changes. These changes would require careful version checks in C++ binding. Too bad C++ bindings fail to have backward compatibility. C++ bindings fail the compatibility. Many systems offer matching C++ and C bindings which are known to work together. We can use the system header if it exists. CL directory is standard location. Mac uses custom OpenCL directory but it might only include C headers.

Co-authored-by: borg323 <[email protected]>

* Move bitboard expansion to gpu. * Fix optimized gpu code path for bitboard expansion. * Fix broadcast to work correctly. * Remove unneeded code. * Fix warnings. * Minor fixes * Remove forwardEvalLegacy() and other alternate codepaths except the bitwiseANDWithPrimaryTensor * Remove unused variables and memory.

Also adds onnxruntime version to cache prefix and removes random name from onnx model.

* Move bitboard expansion to gpu. * Fix optimized gpu code path for bitboard expansion. * Fix broadcast to work correctly. * Remove unneeded code. * Fix warnings. * Minor nit. * Minor fixes * Debug stuff * Move attention policy promo offset calculation to gpu. * Move attention and convolution policy mapping to gpu. * Remove policymapping from legacy codepath. * Remove forwardEvalLegacy() and other alternate codepaths except the bitwiseANDWithPrimaryTensor * Remove unused variables and memory. * Update change

ROCm version 7 is under development. It will include hipBLAS 3. The new API uses different enums for datatype and computetype. We can update our code to use the new version while macros map the new compute type to old values for version 2 support.

-add auto threads and batching -add show platforms to see the supported platforms by device. -add enhancements to show device info.

…ct usage (#2235) Co-authored-by: mooskagh <[email protected]> Co-authored-by: copilot-swe-agent[bot] <[email protected]>

- See 6fa9f13

* reuse onnx input/output tensor memory * bind cuda mem * print some cuda device info * use fixed device addresses in preparation for cuda graph * Allow concurrent compute and DMA memory transfer * Delay network evaluation until previous download has completed * Add expandPlanes cuda kernel to onnx * Remove extra waits from multi step evaluation when using onnx-trt * Let Ort::IoBinding manage tensor object lifetime * Improve onnx-trt optimiser for fixed size size inputs * Add GPU and board ID print to onnx-trt * Check if CUDA version support PCI information. * Add warnings if CUDA support isn't enabled. * Always optimise to the largest batch sizes. Very small batches require a separate optimisation. It costs too much performance for small sizes if optimising the batch sizes 1. Adding special optimisation for very small batches won't a simple change which should be left for future change.

* Add basic NVTX tracing support This adds a few useful basic annotations to Nsight system profiles. It helps compare CPU execution speed to GPU speed. This adds only a few annotations to most likely suspects to cause issues. * Add basic Perfetto support * Add a few more useful trace scope to classic search

* also avoid using direct mish operator

* Fix transposition kernel missing stream * Add separate download and upload streams to cuda * Add graph capture support to cuda backend * Capture graps when network is constructed * Use CPU fp16 conversion in cuda backend * Add option to disable graphs in cuda backend * Fix windows narrowing conversion errors * Add missing stream arguments to cuda kernels * Make it easier to errors inside graph capture * Remove external events from cudnn graphs * Add debug symbols to CUDA objects * Add missing type conversions to GetP and GetM * Fix is_same type detection * Use nvcc in cudnn library path * Only use external events when CUDA >= 11.1 * Only use cudaGraphUpload when CUDA is at least 11.1 * No need to wait for upload when CUDA < 11.1 * Always use CPU for cuda datatype conversions * Use GPU to generate offset pointers * Remove duplicated expandPlanes.

* fix onnx locking * trt_bf16_enable first defined in onnxruntime 1.23 * move lock outside loop and simplify

* Add threads and basic statistics to backendbench * Fix statistics calculation problems * Only sort values which were writen

* Also add fallback bit_cast implementation for gcc 10

* add fp16 conversion test * use bit_cast in fp16 conversions

Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: mooskagh <[email protected]> Co-authored-by: borg323 <[email protected]>

cudaSetDevice may block waiting for unknown driver resource even when GPU doesn't change. We can use cudaGetDevice to check if GPU has changed before calling cudaSetDevice. Fixes random performance GPU idle periods when using demux on Windows.

* Reduce lock contention in demux * Add preferred batch step to cuda and xla backends * Fix slow performance when the last backend gets only a partial split

* Add onnx-migraphx Migraphx support onnx model compile. Onnx EP has ppor handling for variable batch sizes. It causes major slowdowns. Workaround is to use fixed batch sizes which requires user configuration to get the best performance.

1. Removes the need for `cpu_provider_factory.h` as the cpu execution provider is always available. 2. Fixes the onnx include path to work with various installation options or directly the build top level include directory (instead of needeing the correct subdir). 3. Moves onnx configuration includes to `onnx_conf.h`. 4. Uses the new `AppendExecutionProvider()` for DML. 5. During the above I noticed DML has a `performance_preference` attribute that can be set to `high_performance`. 6. Windows binaries for onnx-trt are built with cuda.

Using fp16 for the layernorm first stage (as DML does) is OK, except for some networks with ReLU^2 FFN activation, where the following layernorm overflowed. Also fixes a bug if the ffn activation is different from the default one.

Extensive testing shows the alt_mish expansion has acceptable performance in both fp16 and bf16, with the main issue that it goes to zero faster for negative inputs. The worse fp16 error was at -11.093750, where the returned value was 0 instead of -0.000168702, with the bf16 version very close to the direct calculation (in bf16).

Co-authored-by: Ankan Banerjee <[email protected]> Co-authored-by: borg323 <[email protected]>

g++ generates bogus warnings about too large allocations when using LTO.

* build with cutlass by default * build.cmd doesn't set cutlass_include any more * disables cutlass build when no suitable cuda arch available

* Use strongly typed onnx-trt graphs Strongly typed networks prevent TensorRT from making bad type conversions when building engines. It makes it less likely that onnx-trt builds bad engines on Windows. Strongly typed network requires less different configurations to test which reduces build times too. * Control quantization with optimize option

ContradNamiseb and others added 30 commits July 6, 2025 11:49

add native_arch build option (#2199)

e0feb41

alternative default search configuration (#2200)

24719a6

make it possible to specify the default backend (#2188)

09307ec

set strict_uci_timing_ initially to true (#2203)

91fa747

build support for onnx-trt (#2204)

1cbd507

add cuda 12 build (#2205)

7dab58a

modifying 'check if node_limit was initialized' to not fail if node_l…

5a85e2a

…imit was set to 0, moving 'set node_limit to 4000000000 if it wasnt initialized' from stopper.h to common.cc (#2056) fix for go nodes 0

fix memory limit overflow (#2058)

7d901fd

checking if 'name' parameter was provided with 'setoption' before pro…

c61a16d

…ceeding with execution (#2062)

cuda blas backward compatibility (#1747)

97d4532

prefetch cleanup (#1778)

790aaf4

appveyor onnx build cleanup (#2206)

edb2ea2

custom setoption parsing (#2207)

14b723c

better default and options to set onnx ir (#2209)

e0683b7

update authors, changelog and version before branch (#2210)

1afa0a9

try to fix mac binary uploads (#2212)

6960622

add a build using the latest xcode version (#2217)

fc5cb59

error out for rpe nets (#2218)

5061581

Co-authored-by: borg323 <[email protected]>

Fix warnings in clang / macos compile process (#2216)

60f030f

add onnx model hash to trt cache prefix (#2214)

64dd204

Also adds onnxruntime version to cache prefix and removes random name from onnx model.

alternative fp16 conversions using _Float where supported (#2219)

5f2ab5f

Sycl for AMD build improvements (#2215)

39c5de0

Support new hipBLAS 3 API (#2222)

bdb1884

ROCm version 7 is under development. It will include hipBLAS 3. The new API uses different enums for datatype and computetype. We can update our code to use the new version while macros map the new compute type to old values for version 2 support.

remove c++17 workarounds (#2223)

ed2a400

Update sycl backend (#2228)

b8cc55e

-add auto threads and batching -add show platforms to see the supported platforms by device. -add enhancements to show device info.

Extract V6TrainingData struct to standalone header for external proje…

4071355

…ct usage (#2235) Co-authored-by: mooskagh <[email protected]> Co-authored-by: copilot-swe-agent[bot] <[email protected]>

fix some icx warnings (#2229)

f00c619

Menkib64 and others added 30 commits November 1, 2025 21:07

Support network evaluations per second in uci info (#2329)

8195b0c

Nvcc from cudnn path and debug symbols (#2335)

4b16a98

Update Python bindings to work with new Move representation (#2299)

aacdc7f

- See 6fa9f13

Print error location on Range violation. (#2330)

e2115ea

calculate d as w and l in backend cpu softmax (#2337)

bb616f9

Add bf16 support to onnx-trt (#2344)

74a710c

update mac circleci configuration (#2342)

26575ac

cleaner nvcc sharing for cuda and onnx in meson.build (#2341)

d83051f

onnx full bf16 support requires opset 22 (#2345)

8417bc4

* also avoid using direct mish operator

fix onnx locking (#2352)

dbfed8f

* fix onnx locking * trt_bf16_enable first defined in onnxruntime 1.23 * move lock outside loop and simplify

Backendbench threads and more statistics (#2351)

67fa9e1

* Add threads and basic statistics to backendbench * Fix statistics calculation problems * Only sort values which were writen

Silence warning about modifying protected member. (#2347)

76b8925

* Also add fallback bit_cast implementation for gcc 10

update fp16 conversions (#2340)

3f37760

* add fp16 conversion test * use bit_cast in fp16 conversions

Add coefficient of variation to backendbench (#2353)

9e715d2

Add FLOAT32 encoding support for full precision weights (#2358)

310de9c

Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: mooskagh <[email protected]> Co-authored-by: borg323 <[email protected]>

Demux refactoring(#2339)

b600668

* Reduce lock contention in demux * Add preferred batch step to cuda and xla backends * Fix slow performance when the last backend gets only a partial split

Add onnx-migraphx (#2363)

9f35ac0

* Add onnx-migraphx Migraphx support onnx model compile. Onnx EP has ppor handling for variable batch sizes. It causes major slowdowns. Workaround is to use fixed batch sizes which requires user configuration to get the best performance.

onnx layernorm updates (#2355)

acb4d47

Using fp16 for the layernorm first stage (as DML does) is OK, except for some networks with ReLU^2 FFN activation, where the following layernorm overflowed. Also fixes a bug if the ffn activation is different from the default one.

cutlass fused multihead attention (#1976)

6777ad2

Co-authored-by: Ankan Banerjee <[email protected]> Co-authored-by: borg323 <[email protected]>

Silence LTO warning about too large allocation (#2368)

9aba44a

g++ generates bogus warnings about too large allocations when using LTO.

build cutlass as a subproject (#2369)

be0079a

* build with cutlass by default * build.cmd doesn't set cutlass_include any more * disables cutlass build when no suitable cuda arch available

move the onnx backend code in a directory (#2366)

e482a65

Fix onnx-trt optimize option type (#2372)

7f572ae

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

merge master #1

merge master #1

Uh oh!

ContradNamiseb commented Jul 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

merge master #1

Are you sure you want to change the base?

merge master #1

Uh oh!

Conversation

ContradNamiseb commented Jul 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants