Skip to content

Conversation

@stemann
Copy link
Contributor

@stemann stemann commented Oct 22, 2023

Replaced CUDA binaries with source build for x86_64-linux-gnu

Also,

  • Using CUDA platform selection (CUDA platform augmentation code block).
  • Updated recipe wrt. Torch recipe (in particular wrt. CUDA approach)
  • Reduced depth of git submodule update to 1.

Edit, 2025-01-29: Split recipe separate recipes ONNXRuntime, and ONNXRuntime_CUDA

@stemann stemann force-pushed the feature/onnxruntime_cuda_source_build branch 2 times, most recently from 9d27905 to 62f7177 Compare October 22, 2023 21:21
@stemann stemann marked this pull request as ready for review October 22, 2023 21:37
@stemann stemann force-pushed the feature/onnxruntime_cuda_source_build branch from 62f7177 to 49e97cc Compare October 22, 2023 21:55
@stemann stemann marked this pull request as draft October 22, 2023 22:25
@maleadt
Copy link
Member

maleadt commented Oct 23, 2023

Lots of things have changed since then. It's possible to write more generic recipes that don't hard-code versions and dependencies; grep for CUDA.supported_platforms in other recipes.

@stemann
Copy link
Contributor Author

stemann commented Oct 24, 2023

I see. Would it make sense to re-build TensorRT v8.0.1 (and v8.0.3) using the new approach? (That might entail also re-building CUDNN v8.2.1 and v8.2.4)

The TensorRT v8.0.1 artifacts for CUDA 11.0-11.3 was essentially just re-packaging of the same CUDA (11.3) binaries.

The approach would have to be adapted to also handle CUDA 10.2...

@maleadt
Copy link
Member

maleadt commented Oct 24, 2023

Would it make sense to re-build TensorRT v8.0.1 (and v8.0.3) using the new approach?

I don't think so; the changes don't affect the generated JLLs much, it's more about creating maintainable build recipes.

@stemann
Copy link
Contributor Author

stemann commented Oct 24, 2023

But for the latest CUDNN releases, the artifacts are built with e.g. cuda="11.0", and not minor-version-specific platform specification...

Edit: I was thinking that if TensorRT was re-built with CUDA 11 platform (compatibility) (not limited to 11.0-11.3), then ONNXRuntime could be re-built using CUDA 11.4 (with CUDA 11 platform spec.) - as specified in the table on https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements - instead of using CUDA 11.3 to build ONNXRuntime.

@maleadt
Copy link
Member

maleadt commented Oct 25, 2023

[CUDNN] artifacts are built with e.g. cuda="11.0", and not minor-version-specific platform specification

Why isn't that OK? CUDNN binaries are only provided for 11.x and 12.x, so they are compatible across the entire release. Our platform augmentation uses semver-style matching, so should pick up the correct binary.

When building new artifacts, the new CUDA.supported_platforms will generate a build for every minor version, because the actual compatibility can vary on a case-by-case basis (binaries built for CUDA 11.0 aren't guaranteed to be compatible on all of 11.x, even though it's often the case). But that doesn't require the same of all dependencies, e.g., ONNX with a cuda=11.2 artifact should work just fine with CUDNN cuda=11.0.

@stemann
Copy link
Contributor Author

stemann commented Oct 25, 2023

It might be just fine... I just saw that the more recent CUDNN builds had only one artifact per CUDA-platform-major version, so I was wondering whether the re-packaging of the TensorRT 11.3-binary should be re-done - to get a single artifact rather than four artifacts - which are all just copies of the same 11.3-binary. (edit: but I guess not?)

I did also see that the Yggdrasil CUDA platform augmentation, and that there were some semver-based matching in platforms/cuda.jl#L51, but the code/logic was a little hard to follow (probably due to the logic of Base. BinaryPlatforms comparison strategies)...

@maleadt
Copy link
Member

maleadt commented Oct 25, 2023

I was wondering whether the re-packaging of the TensorRT 11.3-binary should be re-done - to get a single artifact rather than four artifacts - which are all just copies of the same 11.3-binary

If it's just copies, then not. If it's a build, it might be safest to do so, unless you've verified that the binaries are forwards compatible.

@stemann
Copy link
Contributor Author

stemann commented Oct 25, 2023

I was wondering whether the re-packaging of the TensorRT 11.3-binary should be re-done - to get a single artifact rather than four artifacts - which are all just copies of the same 11.3-binary

If it's just copies, then not. If it's a build, it might be safest to do so, unless you've verified that the binaries are forwards compatible.

I'm not completely sure what you are saying here... can you clarify?

The CUDA 11.0-11.3 artifacts for the TensorRT v8.0.1 JLL are just copies of the same CUDA 11.3-binary, cf. https://github.com/JuliaPackaging/Yggdrasil/pull/4347/files#diff-0d6f0803279fc06a9c714fb026ee87dd7ab9150c559ee167e63b2fd34e3f5af1

@maleadt
Copy link
Member

maleadt commented Oct 25, 2023

I meant that if you're just copying the same files and tagging them with different CUDA tags, you shouldn't, and it's fine to only attach the 11.3 tag. Only if you actually build object files from sources using nvcc and the CUDA SDK, it may be safest to build for each CUDA toolkit version. But you're not doing that; I was just adding some context.

@stemann
Copy link
Contributor Author

stemann commented Oct 25, 2023

Right, then in that case the TensorRT v8.0 JLL's are a little bogus, I believe...

@stemann
Copy link
Contributor Author

stemann commented Nov 3, 2023

Seems like this one will be easier to complete once CUDA 10.2 has been restored, cf. #7623.

@stemann stemann force-pushed the feature/onnxruntime_cuda_source_build branch 10 times, most recently from 664a71e to e31476d Compare December 18, 2023 18:45
@stemann stemann force-pushed the feature/onnxruntime_cuda_source_build branch 3 times, most recently from 5e485f1 to 15360c6 Compare November 20, 2024 00:25
@stemann
Copy link
Contributor Author

stemann commented Nov 20, 2024

@maleadt So, I think I finally figured this out :-)

It's pretty obvious by now, but anyway: CUDA platform tag must match when using dependencies in BinaryBuilder: E.g. if building, e.g., ONNXRuntime, for CUDA 11.4, and using CUDNN as a dependency, the CUDNN build version used must have a CUDA 11.4 artifact. And similarly for TensorRT and so on, down the dependency chain.

So I guess, given that it is often the case that binaries built for CUDA 11 are compatible across all CUDA 11 versions (and similar for CUDA 12, I assume), it should be preferable (as long as the binary compatibility holds), to only provide a single CUDA 11-platform artifact, e.g. for CUDA 11.0.

This would mean that e.g. TensorRT v8.0.1 or v8.0.3 for which Nvidia provides a CUDA 11.3 artifact, should be packaged as a CUDA 11.0 artifact (and no 11.1-11.3 copies of this). This sounds a little dodgy, but I hope it's true :-)

This would mean that this package, ONNXRuntime, should target CUDA 11.0 (and use CUDNN + TensorRT for CUDA 11.0), but use the CUDA 11.4 SDK when building (to reflect the versions reportedly used: https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements, https://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html#requirements).

Right? :-)

@maleadt
Copy link
Member

maleadt commented Nov 20, 2024

This would mean that e.g. TensorRT v8.0.1 or v8.0.3 for which Nvidia provides a CUDA 11.3 artifact, should be packaged as a CUDA 11.0 artifact (and no 11.1-11.3 copies of this).

I don't think this is correct. It generally holds that if you compile code for toolkit 11.0 it will work on 11.x (although there have been exceptions!), however, if you compile for a more recent version it may use features that are not available on earlier minor releases. So if TensorRT is distributed for CUDA 11.3, I wouldn't expect it to fully work on 11.2 or below.

@stemann
Copy link
Contributor Author

stemann commented Nov 20, 2024

Ah yes - that makes sense.

OK, then I guess the correct way is probably to build for CUDA 11.3 in this case (as this branch currently does) - as that is the version that has artifacts for the build versions of CUDNN and TensorRT.

@stemann stemann force-pushed the feature/onnxruntime_cuda_source_build branch from 4baaa78 to f0d30aa Compare December 3, 2024 23:48
@stemann stemann force-pushed the feature/onnxruntime_cuda_source_build branch from ea818a5 to 739a15e Compare December 4, 2024 00:34
Fails with: nvcc error   : 'ptxas' died due to signal 11 (Invalid memory reference)
@stemann stemann force-pushed the feature/onnxruntime_cuda_source_build branch from 739a15e to 0d51868 Compare December 4, 2024 00:36
@stemann stemann marked this pull request as ready for review December 4, 2024 00:48
@stemann
Copy link
Contributor Author

stemann commented Dec 4, 2024

@maleadt This should now also be up to current standards (for a combined CPU+CUDA JLL recipe) - similar to #9785.

@stemann stemann marked this pull request as draft January 10, 2025 09:20
@stemann stemann force-pushed the feature/onnxruntime_cuda_source_build branch from d954dd6 to 64e29bd Compare January 29, 2025 15:10
@stemann
Copy link
Contributor Author

stemann commented Jan 29, 2025

@maleadt I have separated the CUDA parts into a separate ONNXRuntime_CUDA recipe - not using the cuda+none stuff, but still using lazy artifacts for ONNXRuntime_CUDA.

Could you check if things are OK, and could you resolve those of your comments above that are "resolved" (i.e. postponed for future).

These recipes will need an update before being rolled into jw3126/ONNXRunTime.jl#19, as the Julia project is now using v1.20+, but I would still like to merge this v1.10 first to get those parts done (including the jetson CUDA 10.2 support).

@stemann stemann marked this pull request as ready for review January 29, 2025 15:29
Copy link
Member

@maleadt maleadt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll defer to you on this, assuming you verified it all works.

@giordano giordano merged commit d68155d into JuliaPackaging:master Feb 3, 2025
14 checks passed
@stemann
Copy link
Contributor Author

stemann commented Feb 3, 2025

@giordano The register job failed for ONNXRuntime_CUDA - any idea what happened between my last commit and the merged commit?

@stemann stemann deleted the feature/onnxruntime_cuda_source_build branch February 3, 2025 22:49
@giordano
Copy link
Member

giordano commented Feb 3, 2025

I tried to restart the job but it's still failing. No clue what's the issue. For future record:

ERROR: LoadError: Incomplete JLL release!  Could not find tarball for aarch64-linux-gnu-cxx03-cuda+10.2-cuda_platform+jetson

@stemann
Copy link
Contributor Author

stemann commented Feb 3, 2025

OK, thanks.

Yeah, I saw that.

Attempting a rebuild in #10426

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants