From master #1

ContradNamiseb · 2024-12-24T00:18:40Z

No description provided.

* update circleci * update appveyor android builds * update appveyor to vs2019 * build with c++20 * fix for vs2019 breakage * update circleci meson for macos

The src directory is re-organized as follows: 1. The benchmark and lc0ctl directories are merged into tools. 2. All backends are moved into neural/backends. 3. The neural/onnx and neural/xla directories remain but only contain the conversion functions and not any backend code. 4. The neural/tables directory contains some generic information that used to be in neural/shared (activation functions, policy maps etc.) that are used by both backends and the onnx converter. 5. The rest of neural/shared is moved to neural/backends/shared/. 6. The rescorer is moved into the trainingdata directory.

)

* Generate AUTHORS file. * Address review comment.

* Atomic vector * Lockless MemCache * Lockless threadsafe wrapper. * Bugfix * Build fix. * Fixing elo-gaining bug.

- See 6fa9f13

* reuse onnx input/output tensor memory * bind cuda mem * print some cuda device info * use fixed device addresses in preparation for cuda graph * Allow concurrent compute and DMA memory transfer * Delay network evaluation until previous download has completed * Add expandPlanes cuda kernel to onnx * Remove extra waits from multi step evaluation when using onnx-trt * Let Ort::IoBinding manage tensor object lifetime * Improve onnx-trt optimiser for fixed size size inputs * Add GPU and board ID print to onnx-trt * Check if CUDA version support PCI information. * Add warnings if CUDA support isn't enabled. * Always optimise to the largest batch sizes. Very small batches require a separate optimisation. It costs too much performance for small sizes if optimising the batch sizes 1. Adding special optimisation for very small batches won't a simple change which should be left for future change.

* Add basic NVTX tracing support This adds a few useful basic annotations to Nsight system profiles. It helps compare CPU execution speed to GPU speed. This adds only a few annotations to most likely suspects to cause issues. * Add basic Perfetto support * Add a few more useful trace scope to classic search

* also avoid using direct mish operator

* Fix transposition kernel missing stream * Add separate download and upload streams to cuda * Add graph capture support to cuda backend * Capture graps when network is constructed * Use CPU fp16 conversion in cuda backend * Add option to disable graphs in cuda backend * Fix windows narrowing conversion errors * Add missing stream arguments to cuda kernels * Make it easier to errors inside graph capture * Remove external events from cudnn graphs * Add debug symbols to CUDA objects * Add missing type conversions to GetP and GetM * Fix is_same type detection * Use nvcc in cudnn library path * Only use external events when CUDA >= 11.1 * Only use cudaGraphUpload when CUDA is at least 11.1 * No need to wait for upload when CUDA < 11.1 * Always use CPU for cuda datatype conversions * Use GPU to generate offset pointers * Remove duplicated expandPlanes.

* fix onnx locking * trt_bf16_enable first defined in onnxruntime 1.23 * move lock outside loop and simplify

* Add threads and basic statistics to backendbench * Fix statistics calculation problems * Only sort values which were writen

* Also add fallback bit_cast implementation for gcc 10

* add fp16 conversion test * use bit_cast in fp16 conversions

Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: mooskagh <[email protected]> Co-authored-by: borg323 <[email protected]>

cudaSetDevice may block waiting for unknown driver resource even when GPU doesn't change. We can use cudaGetDevice to check if GPU has changed before calling cudaSetDevice. Fixes random performance GPU idle periods when using demux on Windows.

* Reduce lock contention in demux * Add preferred batch step to cuda and xla backends * Fix slow performance when the last backend gets only a partial split

* Add onnx-migraphx Migraphx support onnx model compile. Onnx EP has ppor handling for variable batch sizes. It causes major slowdowns. Workaround is to use fixed batch sizes which requires user configuration to get the best performance.

1. Removes the need for `cpu_provider_factory.h` as the cpu execution provider is always available. 2. Fixes the onnx include path to work with various installation options or directly the build top level include directory (instead of needeing the correct subdir). 3. Moves onnx configuration includes to `onnx_conf.h`. 4. Uses the new `AppendExecutionProvider()` for DML. 5. During the above I noticed DML has a `performance_preference` attribute that can be set to `high_performance`. 6. Windows binaries for onnx-trt are built with cuda.

Using fp16 for the layernorm first stage (as DML does) is OK, except for some networks with ReLU^2 FFN activation, where the following layernorm overflowed. Also fixes a bug if the ffn activation is different from the default one.

Extensive testing shows the alt_mish expansion has acceptable performance in both fp16 and bf16, with the main issue that it goes to zero faster for negative inputs. The worse fp16 error was at -11.093750, where the returned value was 0 instead of -0.000168702, with the bf16 version very close to the direct calculation (in bf16).

Co-authored-by: Ankan Banerjee <[email protected]> Co-authored-by: borg323 <[email protected]>

g++ generates bogus warnings about too large allocations when using LTO.

* build with cutlass by default * build.cmd doesn't set cutlass_include any more * disables cutlass build when no suitable cuda arch available

* Use strongly typed onnx-trt graphs Strongly typed networks prevent TensorRT from making bad type conversions when building engines. It makes it less likely that onnx-trt builds bad engines on Windows. Strongly typed network requires less different configurations to test which reduces build times too. * Control quantization with optimize option

borg323 and others added 30 commits October 28, 2024 23:29

clean up onnx casts (#2084)

170e830

onnx make alt_mish work with other data types (#2082)

b725fb2

fix xla with multiple devices (#2081)

96f98af

chore: update configfile.cc (#2085)

dfcd1c3

Move to C++20 (#2088)

9239021

* update circleci * update appveyor android builds * update appveyor to vs2019 * build with c++20 * fix for vs2019 breakage * update circleci meson for macos

Spinlock warning fix (#2095)

242ddf5

speed up position and history (#1761)

4692791

Make EncodePositionForNN accept span<Position> (#2097)

c3a160c

Decoupling EngineLoop from EngineController (#2102)

2123284

Minor changes to proto generation (#2025)

82dd466

Move src/mcts/ to src/search/classic/, and to classic namespace (#2094)

0ba125d

New backend API and wrapper for old networks. (#2098)

8ccae03

Move backend-specific options out of search params.cc (#2104)

fc0abdc

fix rescorer build with newer meson (#2089)

d5c6638

Introduce new search API, add a sample "policy head" algorithm (#2100)

548267c

Adding BackendManager functions to create a backend. (#2106)

ae7b22a

Implement node cache as a backend layer. (#2108)

ffd2419

Plug new Search and Backend APIs to the engine (#2107)

4c1f63e

fix appveyor build (#2110)

7b0ef8d

Add a makefile for OpenBench (#2113)

6d578c0

Fix openbench-specific issues (#2115)

2da3a5b

Chmod build.sh before running. (#2117)

142ffce

Update benchmark defaults for OpenBench (#2118)

64d3951

Use bench for a short benchmark and benchmark for a full one. (#2120

d97cbc1

)

Add AUTHORs file (#2116)

dfb8fbe

* Generate AUTHORS file. * Address review comment.

Make MemCache / Wrapper backends thread-safe (#2112)

49d9f12

* Atomic vector * Lockless MemCache * Lockless threadsafe wrapper. * Bugfix * Build fix. * Fixing elo-gaining bug.

Update logging.h (#2124)

26afe21

Introduce a "valuehead" search algorithm. (#2121)

1209835

Change Search API. (#2127)

c217b0a

Menkib64 and others added 30 commits November 1, 2025 21:07

Support network evaluations per second in uci info (#2329)

8195b0c

Nvcc from cudnn path and debug symbols (#2335)

4b16a98

Update Python bindings to work with new Move representation (#2299)

aacdc7f

- See 6fa9f13

Print error location on Range violation. (#2330)

e2115ea

calculate d as w and l in backend cpu softmax (#2337)

bb616f9

Add bf16 support to onnx-trt (#2344)

74a710c

update mac circleci configuration (#2342)

26575ac

cleaner nvcc sharing for cuda and onnx in meson.build (#2341)

d83051f

onnx full bf16 support requires opset 22 (#2345)

8417bc4

* also avoid using direct mish operator

fix onnx locking (#2352)

dbfed8f

* fix onnx locking * trt_bf16_enable first defined in onnxruntime 1.23 * move lock outside loop and simplify

Backendbench threads and more statistics (#2351)

67fa9e1

* Add threads and basic statistics to backendbench * Fix statistics calculation problems * Only sort values which were writen

Silence warning about modifying protected member. (#2347)

76b8925

* Also add fallback bit_cast implementation for gcc 10

update fp16 conversions (#2340)

3f37760

* add fp16 conversion test * use bit_cast in fp16 conversions

Add coefficient of variation to backendbench (#2353)

9e715d2

Add FLOAT32 encoding support for full precision weights (#2358)

310de9c

Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: mooskagh <[email protected]> Co-authored-by: borg323 <[email protected]>

Demux refactoring(#2339)

b600668

* Reduce lock contention in demux * Add preferred batch step to cuda and xla backends * Fix slow performance when the last backend gets only a partial split

Add onnx-migraphx (#2363)

9f35ac0

* Add onnx-migraphx Migraphx support onnx model compile. Onnx EP has ppor handling for variable batch sizes. It causes major slowdowns. Workaround is to use fixed batch sizes which requires user configuration to get the best performance.

onnx layernorm updates (#2355)

acb4d47

Using fp16 for the layernorm first stage (as DML does) is OK, except for some networks with ReLU^2 FFN activation, where the following layernorm overflowed. Also fixes a bug if the ffn activation is different from the default one.

cutlass fused multihead attention (#1976)

6777ad2

Co-authored-by: Ankan Banerjee <[email protected]> Co-authored-by: borg323 <[email protected]>

Silence LTO warning about too large allocation (#2368)

9aba44a

g++ generates bogus warnings about too large allocations when using LTO.

build cutlass as a subproject (#2369)

be0079a

* build with cutlass by default * build.cmd doesn't set cutlass_include any more * disables cutlass build when no suitable cuda arch available

move the onnx backend code in a directory (#2366)

e482a65

Fix onnx-trt optimize option type (#2372)

7f572ae

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

From master #1

From master #1

Uh oh!

ContradNamiseb commented Dec 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

18 participants

From master #1

Are you sure you want to change the base?

From master #1

Uh oh!

Conversation

ContradNamiseb commented Dec 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

18 participants