Skip to content

Conversation

@ContradNamiseb
Copy link
Member

No description provided.

borg323 and others added 30 commits October 28, 2024 23:29
* update circleci
* update appveyor android builds
* update appveyor to vs2019
* build with c++20
* fix for vs2019 breakage
* update circleci meson for macos
The src directory is re-organized as follows:
1. The benchmark and lc0ctl directories are merged into tools.
2. All backends are moved into neural/backends.
3. The neural/onnx and neural/xla directories remain but only contain the conversion functions and not any backend code.
4. The neural/tables directory contains some generic information that used to be in neural/shared (activation functions, policy maps etc.) that are used by both backends and the onnx converter.
5. The rest of neural/shared is moved to neural/backends/shared/.
6. The rescorer is moved into the trainingdata directory.
* Generate AUTHORS file.

* Address review comment.
* Atomic vector

* Lockless MemCache

* Lockless threadsafe wrapper.

* Bugfix

* Build fix.

* Fixing elo-gaining bug.
Menkib64 and others added 30 commits November 1, 2025 21:07
* reuse onnx input/output tensor memory

* bind cuda mem

* print some cuda device info

* use fixed device addresses in preparation for cuda graph

* Allow concurrent compute and DMA memory transfer

* Delay network evaluation until previous download has completed

* Add expandPlanes cuda kernel to onnx

* Remove extra waits from multi step evaluation when using onnx-trt

* Let Ort::IoBinding manage tensor object lifetime

* Improve onnx-trt optimiser for fixed size size inputs

* Add GPU and board ID print to onnx-trt

* Check if CUDA version support PCI information.

* Add warnings if CUDA support isn't enabled.

* Always optimise to the largest batch sizes.

Very small batches require a separate optimisation. It costs too much
performance for small sizes if optimising the batch sizes 1. Adding
special optimisation for very small batches won't a simple change which
should be left for future change.
* Add basic NVTX tracing support

This adds a few useful basic annotations to Nsight system profiles. It
helps compare CPU execution speed to GPU speed. This adds only a few
annotations to most likely suspects to cause issues.

* Add basic Perfetto support

* Add a few more useful trace scope to classic search
* also avoid using direct mish operator
* Fix transposition kernel missing stream

* Add separate download and upload streams to cuda

* Add graph capture support to cuda backend

* Capture graps when network is constructed

* Use CPU fp16 conversion in cuda backend

* Add option to disable graphs in cuda backend

* Fix windows narrowing conversion errors

* Add missing stream arguments to cuda kernels

* Make it easier to errors inside graph capture

* Remove external events from cudnn graphs

* Add debug symbols to CUDA objects

* Add missing type conversions to GetP and GetM

* Fix is_same type detection

* Use nvcc in cudnn library path

* Only use external events when CUDA >= 11.1

* Only use cudaGraphUpload when CUDA is at least 11.1

* No need to wait for upload when CUDA < 11.1

* Always use CPU for cuda datatype conversions

* Use GPU to generate offset pointers

* Remove duplicated expandPlanes.
* fix onnx locking

* trt_bf16_enable first defined in onnxruntime 1.23

* move lock outside loop and simplify
* Add threads and basic statistics to backendbench

* Fix statistics calculation problems

* Only sort values which were writen
* Also add fallback bit_cast implementation for gcc 10
* add fp16 conversion test
* use bit_cast in fp16 conversions
Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: mooskagh <[email protected]>
Co-authored-by: borg323 <[email protected]>
cudaSetDevice may block waiting for unknown driver resource even when
GPU doesn't change. We can use cudaGetDevice to check if GPU has changed
before calling cudaSetDevice.

Fixes random performance GPU idle periods when using demux on Windows.
* Reduce lock contention in demux
* Add preferred batch step to cuda and xla backends
* Fix slow performance when the last backend gets only a partial split
* Add onnx-migraphx

Migraphx support onnx model compile. Onnx EP has ppor handling for
variable batch sizes. It causes major slowdowns. Workaround is to use
fixed batch sizes which requires user configuration to get the best
performance.
1. Removes the need for `cpu_provider_factory.h` as the cpu execution provider is always available.
2. Fixes the onnx include path to work with various installation options or directly the build top level include directory (instead of needeing the correct subdir).
3. Moves onnx configuration includes to `onnx_conf.h`.
4. Uses the new `AppendExecutionProvider()` for DML.
5. During the above I noticed DML has a `performance_preference` attribute that can be set to `high_performance`.
6. Windows binaries for onnx-trt are built with cuda.
Using fp16 for the layernorm first stage (as DML does) is OK, except for some networks with ReLU^2 FFN activation, where the following layernorm overflowed.
Also fixes a bug if the ffn activation is different from the default one.
Extensive testing shows the alt_mish expansion has acceptable performance in both fp16 and bf16, with the main issue that it goes to zero faster for negative inputs. The worse fp16 error was at -11.093750, where the returned value was 0 instead of -0.000168702, with the bf16 version very close to the direct calculation (in bf16).
Co-authored-by: Ankan Banerjee <[email protected]>
Co-authored-by: borg323 <[email protected]>
g++ generates bogus warnings about too large allocations when using LTO.
* build with cutlass by default
* build.cmd doesn't set cutlass_include any more
* disables cutlass build when no suitable cuda arch available
* Use strongly typed onnx-trt graphs

Strongly typed networks prevent TensorRT from making bad type
conversions when building engines. It makes it less likely that onnx-trt
builds bad engines on Windows.

Strongly typed network requires less different configurations to test
which reduces build times too.

* Control quantization with optimize option
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.