forked from LeelaChessZero/lc0
-
Notifications
You must be signed in to change notification settings - Fork 0
From master #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ContradNamiseb
wants to merge
212
commits into
Bonan14:master
Choose a base branch
from
LeelaChessZero:master
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
From master #1
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* update circleci * update appveyor android builds * update appveyor to vs2019 * build with c++20 * fix for vs2019 breakage * update circleci meson for macos
The src directory is re-organized as follows: 1. The benchmark and lc0ctl directories are merged into tools. 2. All backends are moved into neural/backends. 3. The neural/onnx and neural/xla directories remain but only contain the conversion functions and not any backend code. 4. The neural/tables directory contains some generic information that used to be in neural/shared (activation functions, policy maps etc.) that are used by both backends and the onnx converter. 5. The rest of neural/shared is moved to neural/backends/shared/. 6. The rescorer is moved into the trainingdata directory.
* Generate AUTHORS file. * Address review comment.
* Atomic vector * Lockless MemCache * Lockless threadsafe wrapper. * Bugfix * Build fix. * Fixing elo-gaining bug.
* reuse onnx input/output tensor memory * bind cuda mem * print some cuda device info * use fixed device addresses in preparation for cuda graph * Allow concurrent compute and DMA memory transfer * Delay network evaluation until previous download has completed * Add expandPlanes cuda kernel to onnx * Remove extra waits from multi step evaluation when using onnx-trt * Let Ort::IoBinding manage tensor object lifetime * Improve onnx-trt optimiser for fixed size size inputs * Add GPU and board ID print to onnx-trt * Check if CUDA version support PCI information. * Add warnings if CUDA support isn't enabled. * Always optimise to the largest batch sizes. Very small batches require a separate optimisation. It costs too much performance for small sizes if optimising the batch sizes 1. Adding special optimisation for very small batches won't a simple change which should be left for future change.
* Add basic NVTX tracing support This adds a few useful basic annotations to Nsight system profiles. It helps compare CPU execution speed to GPU speed. This adds only a few annotations to most likely suspects to cause issues. * Add basic Perfetto support * Add a few more useful trace scope to classic search
* also avoid using direct mish operator
* Fix transposition kernel missing stream * Add separate download and upload streams to cuda * Add graph capture support to cuda backend * Capture graps when network is constructed * Use CPU fp16 conversion in cuda backend * Add option to disable graphs in cuda backend * Fix windows narrowing conversion errors * Add missing stream arguments to cuda kernels * Make it easier to errors inside graph capture * Remove external events from cudnn graphs * Add debug symbols to CUDA objects * Add missing type conversions to GetP and GetM * Fix is_same type detection * Use nvcc in cudnn library path * Only use external events when CUDA >= 11.1 * Only use cudaGraphUpload when CUDA is at least 11.1 * No need to wait for upload when CUDA < 11.1 * Always use CPU for cuda datatype conversions * Use GPU to generate offset pointers * Remove duplicated expandPlanes.
* fix onnx locking * trt_bf16_enable first defined in onnxruntime 1.23 * move lock outside loop and simplify
* Add threads and basic statistics to backendbench * Fix statistics calculation problems * Only sort values which were writen
* Also add fallback bit_cast implementation for gcc 10
* add fp16 conversion test * use bit_cast in fp16 conversions
Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: mooskagh <[email protected]> Co-authored-by: borg323 <[email protected]>
cudaSetDevice may block waiting for unknown driver resource even when GPU doesn't change. We can use cudaGetDevice to check if GPU has changed before calling cudaSetDevice. Fixes random performance GPU idle periods when using demux on Windows.
* Reduce lock contention in demux * Add preferred batch step to cuda and xla backends * Fix slow performance when the last backend gets only a partial split
* Add onnx-migraphx Migraphx support onnx model compile. Onnx EP has ppor handling for variable batch sizes. It causes major slowdowns. Workaround is to use fixed batch sizes which requires user configuration to get the best performance.
1. Removes the need for `cpu_provider_factory.h` as the cpu execution provider is always available. 2. Fixes the onnx include path to work with various installation options or directly the build top level include directory (instead of needeing the correct subdir). 3. Moves onnx configuration includes to `onnx_conf.h`. 4. Uses the new `AppendExecutionProvider()` for DML. 5. During the above I noticed DML has a `performance_preference` attribute that can be set to `high_performance`. 6. Windows binaries for onnx-trt are built with cuda.
Using fp16 for the layernorm first stage (as DML does) is OK, except for some networks with ReLU^2 FFN activation, where the following layernorm overflowed. Also fixes a bug if the ffn activation is different from the default one.
Extensive testing shows the alt_mish expansion has acceptable performance in both fp16 and bf16, with the main issue that it goes to zero faster for negative inputs. The worse fp16 error was at -11.093750, where the returned value was 0 instead of -0.000168702, with the bf16 version very close to the direct calculation (in bf16).
Co-authored-by: Ankan Banerjee <[email protected]> Co-authored-by: borg323 <[email protected]>
g++ generates bogus warnings about too large allocations when using LTO.
* build with cutlass by default * build.cmd doesn't set cutlass_include any more * disables cutlass build when no suitable cuda arch available
* Use strongly typed onnx-trt graphs Strongly typed networks prevent TensorRT from making bad type conversions when building engines. It makes it less likely that onnx-trt builds bad engines on Windows. Strongly typed network requires less different configurations to test which reduces build times too. * Control quantization with optimize option
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.