[ONNXRuntime] Replaced CUDA binaries with source build for x86_64-linux-gnu #7561

stemann · 2023-10-22T20:15:07Z

Replaced CUDA binaries with source build for x86_64-linux-gnu

Also,

Using CUDA platform selection (CUDA platform augmentation code block).
Updated recipe wrt. Torch recipe (in particular wrt. CUDA approach)
Reduced depth of git submodule update to 1.

Edit, 2025-01-29: Split recipe separate recipes ONNXRuntime, and ONNXRuntime_CUDA

O/ONNXRuntime/build_tarballs.jl

stemann · 2023-10-23T14:59:38Z

@maleadt Any updated comments to your remarks in #4386 ?

[ONNXRuntime] Added builds with CUDA and TensorRT Execution Providers #4386 (comment)
[ONNXRuntime] Added builds with CUDA and TensorRT Execution Providers #4386 (review)
[ONNXRuntime] Added builds with CUDA and TensorRT Execution Providers #4386 (comment)

maleadt · 2023-10-23T17:24:38Z

Lots of things have changed since then. It's possible to write more generic recipes that don't hard-code versions and dependencies; grep for CUDA.supported_platforms in other recipes.

stemann · 2023-10-24T12:56:11Z

I see. Would it make sense to re-build TensorRT v8.0.1 (and v8.0.3) using the new approach? (That might entail also re-building CUDNN v8.2.1 and v8.2.4)

The TensorRT v8.0.1 artifacts for CUDA 11.0-11.3 was essentially just re-packaging of the same CUDA (11.3) binaries.

The approach would have to be adapted to also handle CUDA 10.2...

maleadt · 2023-10-24T13:01:27Z

Would it make sense to re-build TensorRT v8.0.1 (and v8.0.3) using the new approach?

I don't think so; the changes don't affect the generated JLLs much, it's more about creating maintainable build recipes.

stemann · 2023-10-24T13:18:41Z

But for the latest CUDNN releases, the artifacts are built with e.g. cuda="11.0", and not minor-version-specific platform specification...

Edit: I was thinking that if TensorRT was re-built with CUDA 11 platform (compatibility) (not limited to 11.0-11.3), then ONNXRuntime could be re-built using CUDA 11.4 (with CUDA 11 platform spec.) - as specified in the table on https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements - instead of using CUDA 11.3 to build ONNXRuntime.

maleadt · 2023-10-25T09:56:00Z

[CUDNN] artifacts are built with e.g. cuda="11.0", and not minor-version-specific platform specification

Why isn't that OK? CUDNN binaries are only provided for 11.x and 12.x, so they are compatible across the entire release. Our platform augmentation uses semver-style matching, so should pick up the correct binary.

When building new artifacts, the new CUDA.supported_platforms will generate a build for every minor version, because the actual compatibility can vary on a case-by-case basis (binaries built for CUDA 11.0 aren't guaranteed to be compatible on all of 11.x, even though it's often the case). But that doesn't require the same of all dependencies, e.g., ONNX with a cuda=11.2 artifact should work just fine with CUDNN cuda=11.0.

stemann · 2023-10-25T12:59:34Z

It might be just fine... I just saw that the more recent CUDNN builds had only one artifact per CUDA-platform-major version, so I was wondering whether the re-packaging of the TensorRT 11.3-binary should be re-done - to get a single artifact rather than four artifacts - which are all just copies of the same 11.3-binary. (edit: but I guess not?)

I did also see that the Yggdrasil CUDA platform augmentation, and that there were some semver-based matching in platforms/cuda.jl#L51, but the code/logic was a little hard to follow (probably due to the logic of Base. BinaryPlatforms comparison strategies)...

maleadt · 2023-10-25T14:13:18Z

I was wondering whether the re-packaging of the TensorRT 11.3-binary should be re-done - to get a single artifact rather than four artifacts - which are all just copies of the same 11.3-binary

If it's just copies, then not. If it's a build, it might be safest to do so, unless you've verified that the binaries are forwards compatible.

stemann · 2023-10-25T15:26:34Z

I was wondering whether the re-packaging of the TensorRT 11.3-binary should be re-done - to get a single artifact rather than four artifacts - which are all just copies of the same 11.3-binary

If it's just copies, then not. If it's a build, it might be safest to do so, unless you've verified that the binaries are forwards compatible.

I'm not completely sure what you are saying here... can you clarify?

The CUDA 11.0-11.3 artifacts for the TensorRT v8.0.1 JLL are just copies of the same CUDA 11.3-binary, cf. https://github.com/JuliaPackaging/Yggdrasil/pull/4347/files#diff-0d6f0803279fc06a9c714fb026ee87dd7ab9150c559ee167e63b2fd34e3f5af1

maleadt · 2023-10-25T17:06:45Z

I meant that if you're just copying the same files and tagging them with different CUDA tags, you shouldn't, and it's fine to only attach the 11.3 tag. Only if you actually build object files from sources using nvcc and the CUDA SDK, it may be safest to build for each CUDA toolkit version. But you're not doing that; I was just adding some context.

stemann · 2023-10-25T17:41:15Z

Right, then in that case the TensorRT v8.0 JLL's are a little bogus, I believe...

stemann · 2023-11-03T13:24:30Z

Seems like this one will be easier to complete once CUDA 10.2 has been restored, cf. #7623.

stemann · 2024-11-20T09:19:40Z

@maleadt So, I think I finally figured this out :-)

It's pretty obvious by now, but anyway: CUDA platform tag must match when using dependencies in BinaryBuilder: E.g. if building, e.g., ONNXRuntime, for CUDA 11.4, and using CUDNN as a dependency, the CUDNN build version used must have a CUDA 11.4 artifact. And similarly for TensorRT and so on, down the dependency chain.

So I guess, given that it is often the case that binaries built for CUDA 11 are compatible across all CUDA 11 versions (and similar for CUDA 12, I assume), it should be preferable (as long as the binary compatibility holds), to only provide a single CUDA 11-platform artifact, e.g. for CUDA 11.0.

This would mean that e.g. TensorRT v8.0.1 or v8.0.3 for which Nvidia provides a CUDA 11.3 artifact, should be packaged as a CUDA 11.0 artifact (and no 11.1-11.3 copies of this). This sounds a little dodgy, but I hope it's true :-)

This would mean that this package, ONNXRuntime, should target CUDA 11.0 (and use CUDNN + TensorRT for CUDA 11.0), but use the CUDA 11.4 SDK when building (to reflect the versions reportedly used: https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements, https://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html#requirements).

Right? :-)

maleadt · 2024-11-20T12:17:54Z

This would mean that e.g. TensorRT v8.0.1 or v8.0.3 for which Nvidia provides a CUDA 11.3 artifact, should be packaged as a CUDA 11.0 artifact (and no 11.1-11.3 copies of this).

I don't think this is correct. It generally holds that if you compile code for toolkit 11.0 it will work on 11.x (although there have been exceptions!), however, if you compile for a more recent version it may use features that are not available on earlier minor releases. So if TensorRT is distributed for CUDA 11.3, I wouldn't expect it to fully work on 11.2 or below.

stemann · 2024-11-20T13:36:11Z

Ah yes - that makes sense.

OK, then I guess the correct way is probably to build for CUDA 11.3 in this case (as this branch currently does) - as that is the version that has artifacts for the build versions of CUDNN and TensorRT.

Fails with: nvcc error : 'ptxas' died due to signal 11 (Invalid memory reference)

stemann · 2024-12-04T01:19:41Z

@maleadt This should now also be up to current standards (for a combined CPU+CUDA JLL recipe) - similar to #9785.

O/ONNXRuntime/build_tarballs.jl

stemann · 2025-01-29T15:18:46Z

@maleadt I have separated the CUDA parts into a separate ONNXRuntime_CUDA recipe - not using the cuda+none stuff, but still using lazy artifacts for ONNXRuntime_CUDA.

Could you check if things are OK, and could you resolve those of your comments above that are "resolved" (i.e. postponed for future).

These recipes will need an update before being rolled into jw3126/ONNXRunTime.jl#19, as the Julia project is now using v1.20+, but I would still like to merge this v1.10 first to get those parts done (including the jetson CUDA 10.2 support).

maleadt

I'll defer to you on this, assuming you verified it all works.

stemann · 2025-02-03T13:12:58Z

@giordano The register job failed for ONNXRuntime_CUDA - any idea what happened between my last commit and the merged commit?

giordano · 2025-02-03T23:08:09Z

I tried to restart the job but it's still failing. No clue what's the issue. For future record:

ERROR: LoadError: Incomplete JLL release!  Could not find tarball for aarch64-linux-gnu-cxx03-cuda+10.2-cuda_platform+jetson

stemann · 2025-02-03T23:58:49Z

OK, thanks.

Yeah, I saw that.

Attempting a rebuild in #10426

stemann force-pushed the feature/onnxruntime_cuda_source_build branch 2 times, most recently from 9d27905 to 62f7177 Compare October 22, 2023 21:21

stemann mentioned this pull request Oct 22, 2023

Replaced artifacts with JLL artifacts jw3126/ONNXRunTime.jl#19

Draft

stemann marked this pull request as ready for review October 22, 2023 21:37

stemann force-pushed the feature/onnxruntime_cuda_source_build branch from 62f7177 to 49e97cc Compare October 22, 2023 21:55

stemann marked this pull request as draft October 22, 2023 22:25

maleadt reviewed Oct 23, 2023

View reviewed changes

O/ONNXRuntime/build_tarballs.jl Outdated Show resolved Hide resolved

stemann mentioned this pull request Dec 15, 2023

CUDNN: Rebuild v8.2.1 - to use CUDA_Runtime #7808

Closed

stemann force-pushed the feature/onnxruntime_cuda_source_build branch 10 times, most recently from 664a71e to e31476d Compare December 18, 2023 18:45

stemann force-pushed the feature/onnxruntime_cuda_source_build branch 3 times, most recently from 5e485f1 to 15360c6 Compare November 20, 2024 00:25

stemann force-pushed the feature/onnxruntime_cuda_source_build branch from 4baaa78 to f0d30aa Compare December 3, 2024 23:48

stemann and others added 3 commits December 4, 2024 01:34

Updated recipe

3f8540d

Using CUDA 11.3 to have both a CUDNN and TensorRT

9442428

Updated wrt. Torch recipe

37cef71

stemann force-pushed the feature/onnxruntime_cuda_source_build branch from ea818a5 to 739a15e Compare December 4, 2024 00:34

Disabled x86_64-linux-gnu for CUDA 10.2

0d51868

Fails with: nvcc error : 'ptxas' died due to signal 11 (Invalid memory reference)

stemann force-pushed the feature/onnxruntime_cuda_source_build branch from 739a15e to 0d51868 Compare December 4, 2024 00:36

stemann marked this pull request as ready for review December 4, 2024 00:48

maleadt reviewed Dec 4, 2024

View reviewed changes

O/ONNXRuntime/build_tarballs.jl Outdated Show resolved Hide resolved

O/ONNXRuntime/build_tarballs.jl Outdated Show resolved Hide resolved

O/ONNXRuntime/build_tarballs.jl Outdated Show resolved Hide resolved

O/ONNXRuntime/build_tarballs.jl Outdated Show resolved Hide resolved

stemann marked this pull request as draft January 10, 2025 09:20

stemann added 2 commits January 29, 2025 16:10

Split ONNXRuntime into ONNXRuntime, and ONNXRuntime_CUDA

44ef7a6

Disabled riscv64

64e29bd

stemann force-pushed the feature/onnxruntime_cuda_source_build branch from d954dd6 to 64e29bd Compare January 29, 2025 15:10

stemann marked this pull request as ready for review January 29, 2025 15:29

maleadt approved these changes Feb 3, 2025

View reviewed changes

giordano merged commit d68155d into JuliaPackaging:master Feb 3, 2025
14 checks passed

stemann deleted the feature/onnxruntime_cuda_source_build branch February 3, 2025 22:49

stemann mentioned this pull request Feb 7, 2025

[ONNXRuntime_CUDA] Rebuild v1.10.0 #10426

Merged

stemann mentioned this pull request Feb 18, 2025

Allow overriding platform for dependencies JuliaPackaging/BinaryBuilderBase.jl#417

Open

[ONNXRuntime] Replaced CUDA binaries with source build for x86_64-linux-gnu #7561

[ONNXRuntime] Replaced CUDA binaries with source build for x86_64-linux-gnu #7561

Uh oh!

Conversation

stemann commented Oct 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

stemann commented Oct 23, 2023

Uh oh!

maleadt commented Oct 23, 2023

Uh oh!

stemann commented Oct 24, 2023

Uh oh!

maleadt commented Oct 24, 2023

Uh oh!

stemann commented Oct 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maleadt commented Oct 25, 2023

Uh oh!

stemann commented Oct 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maleadt commented Oct 25, 2023

Uh oh!

stemann commented Oct 25, 2023

Uh oh!

maleadt commented Oct 25, 2023

Uh oh!

stemann commented Oct 25, 2023

Uh oh!

stemann commented Nov 3, 2023

Uh oh!

stemann commented Nov 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maleadt commented Nov 20, 2024

Uh oh!

stemann commented Nov 20, 2024

Uh oh!

stemann commented Dec 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stemann commented Jan 29, 2025

Uh oh!

maleadt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

stemann commented Feb 3, 2025

Uh oh!

giordano commented Feb 3, 2025

Uh oh!

stemann commented Feb 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stemann commented Oct 22, 2023 •

edited

Loading

stemann commented Oct 24, 2023 •

edited

Loading

stemann commented Oct 25, 2023 •

edited

Loading

stemann commented Nov 20, 2024 •

edited

Loading

stemann commented Dec 4, 2024 •

edited

Loading