Skip to content

NVD_GPU environment variable does not work on dual Nvidia GPU system #413

@quiet23

Description

@quiet23

Hi,

my system has 2 Nvidia GPUs:

$ nvidia-smi -L
GPU 0: NVIDIA P104-100 (UUID: GPU-fb525077-4b23-625f-1edb-e1e6c559cce7)
GPU 1: NVIDIA GeForce GT 1030 (UUID: GPU-fd84cda2-8a73-aa66-4106-9504ad09cc6f)

The P104 does not have HW decoder (and video outputs), so the only monitor is connected to the 1030.

I've installed the nvidia-vaapi-driver version 0.0.14 according to the manual, and vainfo succeeds, but lists 0 entrypoints, even if pointed to the correct GPU by using NVD_GPU:

$ NVD_LOG=1 vainfo
libva info: VA-API version 1.20.0
libva info: User environment variable requested driver 'nvidia'
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so
libva info: Found init function __vaDriverInit_1_0
      5582.086991989 [17151-17151] ../src/vabackend.c:2260       __vaDriverInit_1_0 Initialising NVIDIA VA-API Driver
      5582.087009673 [17151-17151] ../src/vabackend.c:2268       __vaDriverInit_1_0 Got DRM FD: 1 4
      5582.087020659 [17151-17151] ../src/vabackend.c:2280       __vaDriverInit_1_0 Now have 0 (0 max) instances
      5582.087024914 [17151-17151] ../src/vabackend.c:2307       __vaDriverInit_1_0 Selecting Direct backend
      5582.123241262 [17151-17151] ../src/direct/nv-driver.c: 305            init_nvdriver Initing nvdriver...
      5582.123290800 [17151-17151] ../src/direct/nv-driver.c: 323            init_nvdriver NVIDIA kernel driver version: 575.57.08, major version: 575, minor version: 57
      5582.123298413 [17151-17151] ../src/direct/nv-driver.c: 330            init_nvdriver Got dev info: 200 1 0 fe
libva info: va_openDriver() returns 0
vainfo: VA-API version: 1.20 (libva 2.12.0)
vainfo: Driver version: VA-API NVDEC driver [direct backend]
vainfo: Supported profile and entrypoints
      5582.265742460 [17151-17151] ../src/vabackend.c:2168              nvTerminate Terminating 0x63929e8c3070
      5582.265826639 [17151-17151] ../src/vabackend.c:2182              nvTerminate Now have 0 (0 max) instances
$ NVD_LOG=1 NVD_GPU=1 vainfo
libva info: VA-API version 1.20.0
libva info: User environment variable requested driver 'nvidia'
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so
libva info: Found init function __vaDriverInit_1_0
      5611.513418768 [17168-17168] ../src/vabackend.c:2260       __vaDriverInit_1_0 Initialising NVIDIA VA-API Driver
      5611.513429831 [17168-17168] ../src/vabackend.c:2268       __vaDriverInit_1_0 Got DRM FD: 1 -1
      5611.513434021 [17168-17168] ../src/vabackend.c:2280       __vaDriverInit_1_0 Now have 0 (0 max) instances
      5611.513438163 [17168-17168] ../src/vabackend.c:2307       __vaDriverInit_1_0 Selecting Direct backend
      5611.547755089 [17168-17168] ../src/direct/direct-export-buf.c: 107      direct_initExporter Searching for GPU: 0 1 128
      5611.547792056 [17168-17168] ../src/direct/direct-export-buf.c: 107      direct_initExporter Searching for GPU: 1 1 129
      5611.547802843 [17168-17168] ../src/direct/direct-export-buf.c: 129      direct_initExporter Found NVIDIA GPU 1 at /dev/dri/renderD129
      5611.547807709 [17168-17168] ../src/direct/nv-driver.c: 305            init_nvdriver Initing nvdriver...
      5611.547833969 [17168-17168] ../src/direct/nv-driver.c: 323            init_nvdriver NVIDIA kernel driver version: 575.57.08, major version: 575, minor version: 57
      5611.547840756 [17168-17168] ../src/direct/nv-driver.c: 330            init_nvdriver Got dev info: 200 1 0 fe
libva info: va_openDriver() returns 0
vainfo: VA-API version: 1.20 (libva 2.12.0)
vainfo: Driver version: VA-API NVDEC driver [direct backend]
vainfo: Supported profile and entrypoints
      5611.687465645 [17168-17168] ../src/vabackend.c:2168              nvTerminate Terminating 0x5ed447b4e070
      5611.687560134 [17168-17168] ../src/vabackend.c:2182              nvTerminate Now have 0 (0 max) instances

I've found a place in nv-driver.c that looks definitely wrong cause nv_rm_control is declared to return bool:

    const int ret = nv_rm_control(context->nvctlFd, context->clientObject, context->clientObject, NV0000_CTRL_CMD_GPU_GET_UUID_FROM_GPU_ID, 0, sizeof(uuidParams), &uuidParams);
    if (ret) {
        return false;
    }

After fixing it like that:

    const bool ret = nv_rm_control(context->nvctlFd, context->clientObject, context->clientObject, NV0000_CTRL_CMD_GPU_GET_UUID_FROM_GPU_ID, 0, sizeof(uuidParams), &uuidParams);
    if (!ret) {
        return false;
    }

the driver started to be able to list the entrypoints even without setting NVD_GPU:

$ NVD_LOG=1 vainfo
libva info: VA-API version 1.20.0
libva info: User environment variable requested driver 'nvidia'
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so
libva info: Found init function __vaDriverInit_1_0
      5933.271500838 [18020-18020] ../src/vabackend.c:2260       __vaDriverInit_1_0 Initialising NVIDIA VA-API Driver
      5933.271515169 [18020-18020] ../src/vabackend.c:2268       __vaDriverInit_1_0 Got DRM FD: 1 4
      5933.271528650 [18020-18020] ../src/vabackend.c:2280       __vaDriverInit_1_0 Now have 0 (0 max) instances
      5933.271535499 [18020-18020] ../src/vabackend.c:2307       __vaDriverInit_1_0 Selecting Direct backend
      5933.312221875 [18020-18020] ../src/direct/nv-driver.c: 305            init_nvdriver Initing nvdriver...
      5933.312270417 [18020-18020] ../src/direct/nv-driver.c: 323            init_nvdriver NVIDIA kernel driver version: 575.57.08, major version: 575, minor version: 57
      5933.312278119 [18020-18020] ../src/direct/nv-driver.c: 330            init_nvdriver Got dev info: 200 1 0 fe
libva info: va_openDriver() returns 0
vainfo: VA-API version: 1.20 (libva 2.12.0)
vainfo: Driver version: VA-API NVDEC driver [direct backend]
vainfo: Supported profile and entrypoints
      VAProfileMPEG2Simple            :	VAEntrypointVLD
      VAProfileMPEG2Main              :	VAEntrypointVLD
      VAProfileVC1Simple              :	VAEntrypointVLD
      VAProfileVC1Main                :	VAEntrypointVLD
      VAProfileVC1Advanced            :	VAEntrypointVLD
      VAProfileH264Main               :	VAEntrypointVLD
      VAProfileH264High               :	VAEntrypointVLD
      VAProfileH264ConstrainedBaseline:	VAEntrypointVLD
      VAProfileHEVCMain               :	VAEntrypointVLD
      VAProfileVP9Profile0            :	VAEntrypointVLD
      VAProfileHEVCMain10             :	VAEntrypointVLD
      VAProfileHEVCMain12             :	VAEntrypointVLD
      VAProfileVP9Profile2            :	VAEntrypointVLD
      5933.441319854 [18020-18020] ../src/vabackend.c:2168              nvTerminate Terminating 0x59d10a544070
      5933.441411411 [18020-18020] ../src/vabackend.c:2182              nvTerminate Now have 0 (0 max) instances

Now, when I try to play a video in Firefox, it correctly lists the HW decoding capabilities of the GPU on about:support page, but gives the "invalid device ordinal' (101)" error in the log when I try to play a video:

$ MOZ_DISABLE_RDD_SANDBOX=1 NVD_LOG=1 firefox
libva info: VA-API version 1.20.0
libva info: User environment variable requested driver 'nvidia'
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so
libva info: Found init function __vaDriverInit_1_0
      6093.560121910 [18503-18533] ../src/vabackend.c:2260       __vaDriverInit_1_0 Initialising NVIDIA VA-API Driver
      6093.560130921 [18503-18533] ../src/vabackend.c:2268       __vaDriverInit_1_0 Got DRM FD: 1 26
      6093.560142661 [18503-18533] ../src/vabackend.c:2280       __vaDriverInit_1_0 Now have 0 (0 max) instances
      6093.560148185 [18503-18533] ../src/vabackend.c:2307       __vaDriverInit_1_0 Selecting Direct backend
      6093.626611700 [18503-18533] ../src/direct/nv-driver.c: 305            init_nvdriver Initing nvdriver...
      6093.626688161 [18503-18533] ../src/direct/nv-driver.c: 323            init_nvdriver NVIDIA kernel driver version: 575.57.08, major version: 575, minor version: 57
      6093.626705166 [18503-18533] ../src/direct/nv-driver.c: 330            init_nvdriver Got dev info: 200 1 0 fe
libva info: va_openDriver() returns 0
      6093.848971320 [18503-18841] ../src/vabackend.c: 988        nvCreateSurfaces2 Creating surface 1920x1088, format 1 (0x7715dc4e9060)
      6093.849013008 [18503-18841] ../src/vabackend.c: 988        nvCreateSurfaces2 Creating surface 1920x1088, format 1 (0x7715dc4e83e0)
      6093.849027056 [18503-18841] ../src/vabackend.c: 988        nvCreateSurfaces2 Creating surface 1920x1088, format 1 (0x7715dc4e8fc0)
      6093.849038518 [18503-18841] ../src/vabackend.c: 988        nvCreateSurfaces2 Creating surface 1920x1088, format 1 (0x7715dc4e8ca0)
      6093.849048740 [18503-18841] ../src/vabackend.c: 988        nvCreateSurfaces2 Creating surface 1920x1088, format 1 (0x7715dc4e80c0)
      6093.849058929 [18503-18841] ../src/vabackend.c: 988        nvCreateSurfaces2 Creating surface 1920x1088, format 1 (0x7715dc4e8a20)
      6093.849069940 [18503-18841] ../src/vabackend.c: 988        nvCreateSurfaces2 Creating surface 1920x1088, format 1 (0x7715dc4e91a0)
      6093.849078532 [18503-18841] ../src/vabackend.c: 988        nvCreateSurfaces2 Creating surface 1920x1088, format 1 (0x7715dc4e8160)
      6093.849086987 [18503-18841] ../src/vabackend.c: 988        nvCreateSurfaces2 Creating surface 1920x1088, format 1 (0x7715dc4e9240)
      6093.849096000 [18503-18841] ../src/vabackend.c: 988        nvCreateSurfaces2 Creating surface 1920x1088, format 1 (0x7715dc4e8de0)
      6093.849104633 [18503-18841] ../src/vabackend.c: 988        nvCreateSurfaces2 Creating surface 1920x1088, format 1 (0x7715dc4e92e0)
      6093.849113084 [18503-18841] ../src/vabackend.c: 988        nvCreateSurfaces2 Creating surface 1920x1088, format 1 (0x7715dc4e8f20)
      6093.849122041 [18503-18841] ../src/vabackend.c: 988        nvCreateSurfaces2 Creating surface 1920x1088, format 1 (0x7715dc4e8660)
      6093.849130975 [18503-18841] ../src/vabackend.c: 988        nvCreateSurfaces2 Creating surface 1920x1088, format 1 (0x7715dc4e8200)
      6093.849140053 [18503-18841] ../src/vabackend.c: 988        nvCreateSurfaces2 Creating surface 1920x1088, format 1 (0x7715dc4e8e80)
      6093.849148689 [18503-18841] ../src/vabackend.c: 988        nvCreateSurfaces2 Creating surface 1920x1088, format 1 (0x7715dc4e8b60)
      6093.849158149 [18503-18841] ../src/vabackend.c: 988        nvCreateSurfaces2 Creating surface 1920x1088, format 1 (0x7715dc4e9100)
      6093.849167167 [18503-18841] ../src/vabackend.c: 988        nvCreateSurfaces2 Creating surface 1920x1088, format 1 (0x7715dc4e8340)
      6093.849176043 [18503-18841] ../src/vabackend.c: 988        nvCreateSurfaces2 Creating surface 1920x1088, format 1 (0x7715dc4e8480)
      6093.849184821 [18503-18841] ../src/vabackend.c: 988        nvCreateSurfaces2 Creating surface 1920x1088, format 1 (0x7715dc4e8700)
      6093.849193501 [18503-18841] ../src/vabackend.c: 988        nvCreateSurfaces2 Creating surface 1920x1088, format 1 (0x7715dc4e85c0)
      6093.849202402 [18503-18841] ../src/vabackend.c: 988        nvCreateSurfaces2 Creating surface 1920x1088, format 1 (0x7715dc4e9380)
      6093.849210843 [18503-18841] ../src/vabackend.c: 988        nvCreateSurfaces2 Creating surface 1920x1088, format 1 (0x7715dc4e82a0)
      6093.849219440 [18503-18841] ../src/vabackend.c: 988        nvCreateSurfaces2 Creating surface 1920x1088, format 1 (0x7715dc4e88e0)
      6093.855416765 [18503-18841] ../src/vabackend.c:1051          nvCreateContext Creating context with 24 render targets, at 1920x1088
      6093.882865331 [18503-18841] ../src/vabackend.c:1134          nvCreateContext Creating decoder: 0x7715dc203000 for context id: 34
      6093.883274713 [18503-18844] ../src/vabackend.c: 403          resolveSurfaces [RT] Resolve thread for 0x7715d6c07000 started
      6093.885448279 [18503-18841] ../src/direct/direct-export-buf.c: 190 direct_allocateBackingImage Allocating BackingImages: 0x7715dca4fa50 1920x1088
      6093.886631203 [18503-18841] ../src/direct/direct-export-buf.c:  75           import_to_cuda CUDA ERROR 'invalid device ordinal' (101)

      6093.887047438 [18503-18841] ../src/direct/direct-export-buf.c: 330    direct_realiseSurface Unable to realise surface: 0x7715dc4e88e0 (0)
      6093.887231219 [18503-18841] ../src/vabackend.c:2147    nvExportSurfaceHandle Unable to export surface
      6093.888541902 [18503-18842] ../src/vabackend.c:1170         nvDestroyContext Destroying context: 34
      6093.888655329 [18503-18842] ../src/vabackend.c: 321           destroyContext Signaling resolve thread to exit
      6093.888706285 [18503-18842] ../src/vabackend.c: 327           destroyContext Waiting for resolve thread to exit
      6094.034130437 [18503-18844] ../src/direct/direct-export-buf.c: 190 direct_allocateBackingImage Allocating BackingImages: 0x7715dca4fd90 1920x1088
      6094.034950442 [18503-18844] ../src/direct/direct-export-buf.c:  75           import_to_cuda CUDA ERROR 'invalid device ordinal' (101)

      6094.035203983 [18503-18844] ../src/direct/direct-export-buf.c: 330    direct_realiseSurface Unable to realise surface: 0x7715dc4e88e0 (0)
      6094.111230459 [18503-18844] ../src/vabackend.c: 458          resolveSurfaces [RT] Resolve thread for 0x7715d6c07000 exiting
      6094.111279813 [18503-18842] ../src/vabackend.c: 329           destroyContext Finished waiting for resolve thread with 0
      6094.111402997 [18503-18842] ../src/vabackend.c:1023        nvDestroySurfaces Destroying surface 0 (0x7715dc4e88e0)
      6094.111411050 [18503-18842] ../src/vabackend.c:1023        nvDestroySurfaces Destroying surface -1 (0x7715dc4e82a0)
      6094.111414857 [18503-18842] ../src/vabackend.c:1023        nvDestroySurfaces Destroying surface -1 (0x7715dc4e9380)
      6094.111418464 [18503-18842] ../src/vabackend.c:1023        nvDestroySurfaces Destroying surface -1 (0x7715dc4e85c0)
      6094.111421853 [18503-18842] ../src/vabackend.c:1023        nvDestroySurfaces Destroying surface -1 (0x7715dc4e8700)
      6094.111425209 [18503-18842] ../src/vabackend.c:1023        nvDestroySurfaces Destroying surface -1 (0x7715dc4e8480)
      6094.111428587 [18503-18842] ../src/vabackend.c:1023        nvDestroySurfaces Destroying surface -1 (0x7715dc4e8340)
      6094.111432771 [18503-18842] ../src/vabackend.c:1023        nvDestroySurfaces Destroying surface -1 (0x7715dc4e9100)
      6094.111437154 [18503-18842] ../src/vabackend.c:1023        nvDestroySurfaces Destroying surface -1 (0x7715dc4e8b60)
      6094.111441223 [18503-18842] ../src/vabackend.c:1023        nvDestroySurfaces Destroying surface -1 (0x7715dc4e8e80)
      6094.111445244 [18503-18842] ../src/vabackend.c:1023        nvDestroySurfaces Destroying surface -1 (0x7715dc4e8200)
      6094.111449276 [18503-18842] ../src/vabackend.c:1023        nvDestroySurfaces Destroying surface -1 (0x7715dc4e8660)
      6094.111453225 [18503-18842] ../src/vabackend.c:1023        nvDestroySurfaces Destroying surface -1 (0x7715dc4e8f20)
      6094.111457209 [18503-18842] ../src/vabackend.c:1023        nvDestroySurfaces Destroying surface -1 (0x7715dc4e92e0)
      6094.111461242 [18503-18842] ../src/vabackend.c:1023        nvDestroySurfaces Destroying surface -1 (0x7715dc4e8de0)
      6094.111466864 [18503-18842] ../src/vabackend.c:1023        nvDestroySurfaces Destroying surface -1 (0x7715dc4e9240)
      6094.111474748 [18503-18842] ../src/vabackend.c:1023        nvDestroySurfaces Destroying surface -1 (0x7715dc4e8160)
      6094.111481457 [18503-18842] ../src/vabackend.c:1023        nvDestroySurfaces Destroying surface -1 (0x7715dc4e91a0)
      6094.111488247 [18503-18842] ../src/vabackend.c:1023        nvDestroySurfaces Destroying surface -1 (0x7715dc4e8a20)
      6094.111494665 [18503-18842] ../src/vabackend.c:1023        nvDestroySurfaces Destroying surface -1 (0x7715dc4e80c0)
      6094.111501065 [18503-18842] ../src/vabackend.c:1023        nvDestroySurfaces Destroying surface -1 (0x7715dc4e8ca0)
      6094.111507339 [18503-18842] ../src/vabackend.c:1023        nvDestroySurfaces Destroying surface -1 (0x7715dc4e8fc0)
      6094.111513839 [18503-18842] ../src/vabackend.c:1023        nvDestroySurfaces Destroying surface -1 (0x7715dc4e83e0)
      6094.111520304 [18503-18842] ../src/vabackend.c:1023        nvDestroySurfaces Destroying surface -1 (0x7715dc4e9060)
      6094.111529986 [18503-18842] ../src/vabackend.c:2168              nvTerminate Terminating 0x77161ff36d40
      6094.111650697 [18503-18842] ../src/vabackend.c:2182              nvTerminate Now have 0 (0 max) instances

The error is returned by the cuExternalMemoryGetMappedMipmappedArray function after succeeding call of cuImportExternalMemory. I can only guess that maybe the memory that has been allocated earlier by alloc_image (direct-export-buf.c:192) and used in the cuImportExternalMemory and cuExternalMemoryGetMappedMipmappedArray calls was allocated on the wrong GPU, because alloc_image -> alloc_memory -> nv_alloc_object call uses NV01_MEMORY_LOCAL_USER flag that suggests the memory is allocated on the default Nvidia device, which is the headless P104 on my system, and I was unable to change that.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions