Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apptainer docker tensorflow container issue with 2.18 - skipping loading of GPU #217

Open
ashep29 opened this issue Jan 20, 2025 · 2 comments

Comments

@ashep29
Copy link

ashep29 commented Jan 20, 2025

I'm using apptainer pull docker://tensorflow/tensorflow:latest-gpu, but tensorflow 2.18 is skipping loading of GPU, with this message:

W0000 00:00:1736383795.205652 747392 gpu_device.cc:2344] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at [https://www.tensorflow.org/install/gpu] for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

The tensor 2.17 container worked fine, but it appears that the 2.18 container wants to use libcudnn.9.0 but the container only provides libcudnn.8.0. The system libcudnn is not mapped to the container. The system has both 8 and 9 installed. This looks like a bug in the build of the container.

Note: Just using the system python and installing 2.18 (no containers), works fine as the system has both cudnn 8 and 9 installed.

@ngaywood
Copy link

The tensorflow 2.18 tried to open libcudnn.so.9 that is missing from the docker container.

This has also been reported here [libcudnn.so.9 missing] (tensorflow/tensorflow#80538 (comment))

@JoshCaughtFire
Copy link

Afaik, based on this https://www.tensorflow.org/install/source#gpu it appears that 2.18 requires cuDNN 9.3, however the current build (https://github.com/tensorflow/build/blob/master/tensorflow_runtime_dockerfiles/gpu.packages.txt) is installing 8.9.6.50. From my understanding, that's the version last used with 2.13 and actually might explain some of the stability issues I've seen training on the docker image for recent versions.

Building an docker image from the official image with that lib installs seems to resolve it:

FROM tensorflow/tensorflow:2.18.0-gpu

RUN apt update && apt install -y --no-install-recommends libcudnn9-cuda-12=9.3.0.75-1

Fixing the official image would be preferred, but this has worked on my project. I haven't looked at how it is built enough to know if it's as simple as updating the gpu.packages.txt or if it needs to be updated based on the version it's being built against as I think this is built internally at Google and not sure their setup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants