Skip to content

Conversation

@ContradNamiseb
Copy link
Member

No description provided.

ContradNamiseb and others added 30 commits July 6, 2025 11:49
…imit was set to 0, moving 'set node_limit to 4000000000 if it wasnt initialized' from stopper.h to common.cc (#2056)

fix for go nodes 0
OpenCL extension can make incompatible changes. These changes would
require careful version checks in C++ binding. Too bad C++ bindings
fail to have backward compatibility. C++ bindings fail the
compatibility.

Many systems offer matching C++ and C bindings which are known to work
together. We can use the system header if it exists. CL directory is
standard location. Mac uses custom OpenCL directory but it might only
include C headers.
* Move bitboard expansion to gpu.

* Fix optimized gpu code path for bitboard expansion.

* Fix broadcast to work correctly.

* Remove unneeded code.

* Fix warnings.

* Minor fixes

* Remove forwardEvalLegacy() and other alternate codepaths except the bitwiseANDWithPrimaryTensor

* Remove unused variables and memory.
Also adds onnxruntime version to cache prefix and removes random name from onnx model.
* Move bitboard expansion to gpu.

* Fix optimized gpu code path for bitboard expansion.

* Fix broadcast to work correctly.

* Remove unneeded code.

* Fix warnings.

* Minor nit.

* Minor fixes

* Debug stuff

* Move attention policy promo offset calculation to gpu.

* Move attention and convolution policy mapping to gpu.

* Remove policymapping from legacy codepath.

* Remove forwardEvalLegacy() and other alternate codepaths except the bitwiseANDWithPrimaryTensor

* Remove unused variables and memory.

* Update change
ROCm version 7 is under development. It will include hipBLAS 3. The new
API uses different enums for datatype and computetype. We can update our
code to use the new version while macros map the new compute type to old
values for version 2 support.
-add auto threads and batching
-add show platforms to see the supported platforms by device.
-add enhancements to show device info.
…ct usage (#2235)

Co-authored-by: mooskagh <[email protected]>
Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Menkib64 and others added 30 commits November 1, 2025 21:07
* reuse onnx input/output tensor memory

* bind cuda mem

* print some cuda device info

* use fixed device addresses in preparation for cuda graph

* Allow concurrent compute and DMA memory transfer

* Delay network evaluation until previous download has completed

* Add expandPlanes cuda kernel to onnx

* Remove extra waits from multi step evaluation when using onnx-trt

* Let Ort::IoBinding manage tensor object lifetime

* Improve onnx-trt optimiser for fixed size size inputs

* Add GPU and board ID print to onnx-trt

* Check if CUDA version support PCI information.

* Add warnings if CUDA support isn't enabled.

* Always optimise to the largest batch sizes.

Very small batches require a separate optimisation. It costs too much
performance for small sizes if optimising the batch sizes 1. Adding
special optimisation for very small batches won't a simple change which
should be left for future change.
* Add basic NVTX tracing support

This adds a few useful basic annotations to Nsight system profiles. It
helps compare CPU execution speed to GPU speed. This adds only a few
annotations to most likely suspects to cause issues.

* Add basic Perfetto support

* Add a few more useful trace scope to classic search
* also avoid using direct mish operator
* Fix transposition kernel missing stream

* Add separate download and upload streams to cuda

* Add graph capture support to cuda backend

* Capture graps when network is constructed

* Use CPU fp16 conversion in cuda backend

* Add option to disable graphs in cuda backend

* Fix windows narrowing conversion errors

* Add missing stream arguments to cuda kernels

* Make it easier to errors inside graph capture

* Remove external events from cudnn graphs

* Add debug symbols to CUDA objects

* Add missing type conversions to GetP and GetM

* Fix is_same type detection

* Use nvcc in cudnn library path

* Only use external events when CUDA >= 11.1

* Only use cudaGraphUpload when CUDA is at least 11.1

* No need to wait for upload when CUDA < 11.1

* Always use CPU for cuda datatype conversions

* Use GPU to generate offset pointers

* Remove duplicated expandPlanes.
* fix onnx locking

* trt_bf16_enable first defined in onnxruntime 1.23

* move lock outside loop and simplify
* Add threads and basic statistics to backendbench

* Fix statistics calculation problems

* Only sort values which were writen
* Also add fallback bit_cast implementation for gcc 10
* add fp16 conversion test
* use bit_cast in fp16 conversions
Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: mooskagh <[email protected]>
Co-authored-by: borg323 <[email protected]>
cudaSetDevice may block waiting for unknown driver resource even when
GPU doesn't change. We can use cudaGetDevice to check if GPU has changed
before calling cudaSetDevice.

Fixes random performance GPU idle periods when using demux on Windows.
* Reduce lock contention in demux
* Add preferred batch step to cuda and xla backends
* Fix slow performance when the last backend gets only a partial split
* Add onnx-migraphx

Migraphx support onnx model compile. Onnx EP has ppor handling for
variable batch sizes. It causes major slowdowns. Workaround is to use
fixed batch sizes which requires user configuration to get the best
performance.
1. Removes the need for `cpu_provider_factory.h` as the cpu execution provider is always available.
2. Fixes the onnx include path to work with various installation options or directly the build top level include directory (instead of needeing the correct subdir).
3. Moves onnx configuration includes to `onnx_conf.h`.
4. Uses the new `AppendExecutionProvider()` for DML.
5. During the above I noticed DML has a `performance_preference` attribute that can be set to `high_performance`.
6. Windows binaries for onnx-trt are built with cuda.
Using fp16 for the layernorm first stage (as DML does) is OK, except for some networks with ReLU^2 FFN activation, where the following layernorm overflowed.
Also fixes a bug if the ffn activation is different from the default one.
Extensive testing shows the alt_mish expansion has acceptable performance in both fp16 and bf16, with the main issue that it goes to zero faster for negative inputs. The worse fp16 error was at -11.093750, where the returned value was 0 instead of -0.000168702, with the bf16 version very close to the direct calculation (in bf16).
Co-authored-by: Ankan Banerjee <[email protected]>
Co-authored-by: borg323 <[email protected]>
g++ generates bogus warnings about too large allocations when using LTO.
* build with cutlass by default
* build.cmd doesn't set cutlass_include any more
* disables cutlass build when no suitable cuda arch available
* Use strongly typed onnx-trt graphs

Strongly typed networks prevent TensorRT from making bad type
conversions when building engines. It makes it less likely that onnx-trt
builds bad engines on Windows.

Strongly typed network requires less different configurations to test
which reduces build times too.

* Control quantization with optimize option
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.