-
Notifications
You must be signed in to change notification settings - Fork 651
[ONNXRuntime] Replaced CUDA binaries with source build for x86_64-linux-gnu #7561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ONNXRuntime] Replaced CUDA binaries with source build for x86_64-linux-gnu #7561
Conversation
9d27905 to
62f7177
Compare
62f7177 to
49e97cc
Compare
|
Lots of things have changed since then. It's possible to write more generic recipes that don't hard-code versions and dependencies; grep for |
|
I see. Would it make sense to re-build TensorRT v8.0.1 (and v8.0.3) using the new approach? (That might entail also re-building CUDNN v8.2.1 and v8.2.4) The TensorRT v8.0.1 artifacts for CUDA 11.0-11.3 was essentially just re-packaging of the same CUDA (11.3) binaries. The approach would have to be adapted to also handle CUDA 10.2... |
I don't think so; the changes don't affect the generated JLLs much, it's more about creating maintainable build recipes. |
|
But for the latest CUDNN releases, the artifacts are built with e.g. Edit: I was thinking that if TensorRT was re-built with CUDA 11 platform (compatibility) (not limited to 11.0-11.3), then ONNXRuntime could be re-built using CUDA 11.4 (with CUDA 11 platform spec.) - as specified in the table on https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements - instead of using CUDA 11.3 to build ONNXRuntime. |
Why isn't that OK? CUDNN binaries are only provided for 11.x and 12.x, so they are compatible across the entire release. Our platform augmentation uses semver-style matching, so should pick up the correct binary. When building new artifacts, the new |
|
It might be just fine... I just saw that the more recent CUDNN builds had only one artifact per CUDA-platform-major version, so I was wondering whether the re-packaging of the TensorRT 11.3-binary should be re-done - to get a single artifact rather than four artifacts - which are all just copies of the same 11.3-binary. (edit: but I guess not?) I did also see that the Yggdrasil CUDA platform augmentation, and that there were some semver-based matching in platforms/cuda.jl#L51, but the code/logic was a little hard to follow (probably due to the logic of |
If it's just copies, then not. If it's a build, it might be safest to do so, unless you've verified that the binaries are forwards compatible. |
I'm not completely sure what you are saying here... can you clarify? The CUDA 11.0-11.3 artifacts for the TensorRT v8.0.1 JLL are just copies of the same CUDA 11.3-binary, cf. https://github.com/JuliaPackaging/Yggdrasil/pull/4347/files#diff-0d6f0803279fc06a9c714fb026ee87dd7ab9150c559ee167e63b2fd34e3f5af1 |
|
I meant that if you're just copying the same files and tagging them with different CUDA tags, you shouldn't, and it's fine to only attach the |
|
Right, then in that case the TensorRT v8.0 JLL's are a little bogus, I believe... |
|
Seems like this one will be easier to complete once CUDA 10.2 has been restored, cf. #7623. |
664a71e to
e31476d
Compare
5e485f1 to
15360c6
Compare
|
@maleadt So, I think I finally figured this out :-) It's pretty obvious by now, but anyway: CUDA platform tag must match when using dependencies in BinaryBuilder: E.g. if building, e.g., ONNXRuntime, for CUDA 11.4, and using CUDNN as a dependency, the CUDNN build version used must have a CUDA 11.4 artifact. And similarly for TensorRT and so on, down the dependency chain. So I guess, given that it is often the case that binaries built for CUDA 11 are compatible across all CUDA 11 versions (and similar for CUDA 12, I assume), it should be preferable (as long as the binary compatibility holds), to only provide a single CUDA 11-platform artifact, e.g. for CUDA 11.0. This would mean that e.g. TensorRT v8.0.1 or v8.0.3 for which Nvidia provides a CUDA 11.3 artifact, should be packaged as a CUDA 11.0 artifact (and no 11.1-11.3 copies of this). This sounds a little dodgy, but I hope it's true :-) This would mean that this package, ONNXRuntime, should target CUDA 11.0 (and use CUDNN + TensorRT for CUDA 11.0), but use the CUDA 11.4 SDK when building (to reflect the versions reportedly used: https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements, https://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html#requirements). Right? :-) |
I don't think this is correct. It generally holds that if you compile code for toolkit 11.0 it will work on 11.x (although there have been exceptions!), however, if you compile for a more recent version it may use features that are not available on earlier minor releases. So if TensorRT is distributed for CUDA 11.3, I wouldn't expect it to fully work on 11.2 or below. |
|
Ah yes - that makes sense. OK, then I guess the correct way is probably to build for CUDA 11.3 in this case (as this branch currently does) - as that is the version that has artifacts for the build versions of CUDNN and TensorRT. |
4baaa78 to
f0d30aa
Compare
ea818a5 to
739a15e
Compare
Fails with: nvcc error : 'ptxas' died due to signal 11 (Invalid memory reference)
739a15e to
0d51868
Compare
d954dd6 to
64e29bd
Compare
|
@maleadt I have separated the CUDA parts into a separate ONNXRuntime_CUDA recipe - not using the cuda+none stuff, but still using lazy artifacts for ONNXRuntime_CUDA. Could you check if things are OK, and could you resolve those of your comments above that are "resolved" (i.e. postponed for future). These recipes will need an update before being rolled into jw3126/ONNXRunTime.jl#19, as the Julia project is now using v1.20+, but I would still like to merge this v1.10 first to get those parts done (including the jetson CUDA 10.2 support). |
maleadt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll defer to you on this, assuming you verified it all works.
|
@giordano The register job failed for ONNXRuntime_CUDA - any idea what happened between my last commit and the merged commit? |
|
I tried to restart the job but it's still failing. No clue what's the issue. For future record: |
|
OK, thanks. Yeah, I saw that. Attempting a rebuild in #10426 |
Replaced CUDA binaries with source build for x86_64-linux-gnu
Also,
Edit, 2025-01-29: Split recipe separate recipes ONNXRuntime, and ONNXRuntime_CUDA