Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Renderer GPU(vulkan/x11): stops rendering after window resize #11075

Closed
Green-Sky opened this issue Oct 5, 2024 · 20 comments
Closed

Renderer GPU(vulkan/x11): stops rendering after window resize #11075

Green-Sky opened this issue Oct 5, 2024 · 20 comments
Assignees
Labels
help wanted Extra attention is needed
Milestone

Comments

@Green-Sky
Copy link
Contributor

Green-Sky commented Oct 5, 2024

With the following error log:

ERROR: Failed to acquire swapchain texture:
ERROR: Failed to acquire swapchain texture:
ERROR: Failed to acquire swapchain texture:

The rest of the application continues fine sdl events seem to stop.

on https://github.com/libsdl-org/SDL/releases/tag/preview-3.1.3
9dd8859 still worked fine

edit: this seems to have gradually degraded

  • 9dd8859 still worked fine
  • sometime after it started to print the ERROR: Failed to acquire swapchain texture: while resizing and look funny, but still keep working after
  • afdf325 breaks it totally

edit2: this is on a linux x11 NVIDIA device (555.58.02)

@Green-Sky
Copy link
Contributor Author

Force quitting hung the whole x11 session for 1sec.

@Green-Sky
Copy link
Contributor Author

$ git bisect good
afdf325fb4090e93a124519d1a3bc1fbe0ba9025 is the first bad commit
commit afdf325fb4090e93a124519d1a3bc1fbe0ba9025
Author: Evan Hemsley <[email protected]>
Date:   Mon Sep 30 10:23:19 2024 -0700

    GPU: Add swapchain dimension out params (#11003)

 include/SDL3/SDL_gpu.h          |  22 ++-
 src/dynapi/SDL_dynapi_procs.h   |   2 +-
 src/gpu/SDL_gpu.c               |  12 +-
 src/gpu/SDL_sysgpu.h            |   4 +-
 src/gpu/d3d11/SDL_gpu_d3d11.c   |  23 ++-
 src/gpu/d3d12/SDL_gpu_d3d12.c   |  26 ++-
 src/gpu/metal/SDL_gpu_metal.m   |  16 +-
 src/gpu/vulkan/SDL_gpu_vulkan.c | 492 +++++++++++++++++++++++------------------------
 src/render/gpu/SDL_render_gpu.c |  19 +-
 test/testgpu_simple_clear.c     |   2 +-
 test/testgpu_spinning_cube.c    |   6 +-
 11 files changed, 343 insertions(+), 281 deletions(-)

#11003

@flibitijibibo
Copy link
Collaborator

A few things that'll help us diagnose:

  • Is there a specific test app that exhibits this behavior?
  • Does this also happen via Xwayland?
  • Do the Vulkan validation layers point to anything in particular? The SDL examples should enable them in debug mode, provided the system has them installed.

@flibitijibibo flibitijibibo added this to the 3.2.0 milestone Oct 5, 2024
@Green-Sky
Copy link
Contributor Author

Green-Sky commented Oct 5, 2024

A few things that'll help us diagnose:

  • Is there a specific test app that exhibits this behavior?

The test/testgpu_spinning_cube simply exits as soon as first

Failed to acquire swapchain texture:

is encountered.
I checked and be401dd introduced this behavoir. This seems to be intended, but I am not sure it actually is an error that is reported.

The test/testnative executable however exhibits my issue perfectly. Just resize it until it hangs the screen or stops rendering (but keep running).

sdl3_renderer_gpu_vulkan_error1.mp4

(includes a lack of frames at the x11(?) freeze)

  • Does this also happen via Xwayland?
  • Do the Vulkan validation layers point to anything in particular? The SDL examples should enable them in debug mode, provided the system has them installed.

Not sure how to enable the validation layers, but I will keep trying.
On my x11-nvidia nixos setup I am not comfortable switching to wayland yet, however that is on my longterm todo list :)

@thatcosmonaut
Copy link
Collaborator

Curious if this is possibly related to #9698

@meyraud705
Copy link
Contributor

On an AMD card on X11 I get errors and sometimes a validation layer message when resizing, then application continues fine:

ERROR: vkQueuePresentKHR VK_SUBOPTIMAL_KHR
ERROR: vkQueuePresentKHR VK_SUBOPTIMAL_KHR
ERROR: vkQueuePresentKHR VK_SUBOPTIMAL_KHR
VUID-VkSwapchainCreateInfoKHR-pNext-07781(ERROR / SPEC): msgNum: 1284057537 - Validation Error: [ VUID-VkSwapchainCreateInfoKHR-pNext-07781 ] | MessageID = 0x4c8929c1 | vkCreateSwapchainKHR(): pCreateInfo->imageExtent (width = 545, height = 462), which is outside the bounds returned by vkGetPhysicalDeviceSurfaceCapabilitiesKHR(): currentExtent = (width = 553, height = 468), minImageExtent = (width = 553, height = 468), maxImageExtent = (width = 553, height = 468). The Vulkan spec states: If a VkSwapchainPresentScalingCreateInfoEXT structure was not included in the pNext chain, or it is included and VkSwapchainPresentScalingCreateInfoEXT::scalingBehavior is zero then imageExtent must be between minImageExtent and maxImageExtent, inclusive, where minImageExtent and maxImageExtent are members of the VkSurfaceCapabilitiesKHR structure returned by vkGetPhysicalDeviceSurfaceCapabilitiesKHR for the surface (https://www.khronos.org/registry/vulkan/specs/1.3-extensions/html/vkspec.html#VUID-VkSwapchainCreateInfoKHR-pNext-07781)
    Objects: 0
ERROR: vkQueuePresentKHR VK_SUBOPTIMAL_KHR
ERROR: vkQueuePresentKHR VK_SUBOPTIMAL_KHR

@flibitijibibo
Copy link
Collaborator

I think we ended up removing the extent checks because we thought the window events covered it, but it seems X11 has other ideas - I think all we need to revert from the bad commits is the removal of min/max size checks and this will work again.

@thatcosmonaut
Copy link
Collaborator

This may have been fixed by 6ae5666. Someone who can repro will have to confirm.

@Green-Sky
Copy link
Contributor Author

@thatcosmonaut I did check yesterday, but no change.

@thatcosmonaut
Copy link
Collaborator

@Green-Sky Could you try testing this PR: #11139

@Green-Sky
Copy link
Contributor Author

@thatcosmonaut the pr does not change the behavior.

@KitsuneAlex
Copy link

@thatcosmonaut the pr does not change the behavior.

Can confirm, issue is persisting for me on PopOS 22.04/Kernel 6.9.3-76060903-generic/X11/NVIDIA 560.35.03

@flibitijibibo flibitijibibo changed the title Renderer GPU(vulkan): stops rendering after window resize Renderer GPU(vulkan/x11): stops rendering after window resize Jan 9, 2025
@flibitijibibo
Copy link
Collaborator

We may need additional help with this one as I'm pretty sure all of us are on Wayland systems at this point, and I haven't seen this with Xwayland or Wayland in my own testing of FNA's swapchains. If any X-perts want to volunteer we'd really like to reassign this so cosmonaut can focus on threading and fragment storage writes.

@flibitijibibo flibitijibibo added the help wanted Extra attention is needed label Jan 9, 2025
@kg
Copy link
Contributor

kg commented Jan 21, 2025

I can reproduce with spinning cube in my debian VM (which i don't think is using wayland). After a few resizes it segfaults and my compositor seems to restart (screen goes black and journalctl log shows a bunch of XCB errors + a bunch of hardware info dumps from kwin_x11).

testnative is fine though no matter how many times I resize it.

@kg
Copy link
Contributor

kg commented Jan 21, 2025

valgrind shows some errors:

==127318== Invalid read of size 8
==127318==    at 0x4A10D7D: VULKAN_INTERNAL_DefragmentMemory (SDL_gpu_vulkan.c:10641)
==127318==    by 0x4A10D7D: VULKAN_Submit (SDL_gpu_vulkan.c:10569)
==127318==    by 0x10EEB1: Render (testgpu_spinning_cube.c:457)
==127318==    by 0x10EEB1: loop (testgpu_spinning_cube.c:677)
==127318==    by 0x10EEB1: loop (testgpu_spinning_cube.c:666)
==127318==    by 0x10DDC4: main (testgpu_spinning_cube.c:745)
==127318==  Address 0x5cf1480 is 16 bytes after a block of size 16 alloc'd
==127318==    at 0x48407B4: malloc (vg_replace_malloc.c:381)
==127318==    by 0x4929F92: SDL_malloc_REAL (SDL_malloc.c:6452)
==127318==    by 0x4A0BDDB: VULKAN_INTERNAL_CreateUniformBuffer (SDL_gpu_vulkan.c:6802)
==127318==    by 0x4A0BDDB: VULKAN_CreateDevice (SDL_gpu_vulkan.c:11681)
==127318==    by 0x48CAAE4: SDL_CreateGPUDeviceWithProperties_REAL (SDL_gpu.c:529)
==127318==    by 0x48CAB50: SDL_CreateGPUDevice_REAL (SDL_gpu.c:507)
==127318==    by 0x10E5BF: init_render_state (testgpu_spinning_cube.c:514)
==127318==    by 0x10DDA0: main (testgpu_spinning_cube.c:734)
==127318== 
==127318== Invalid read of size 1
==127318==    at 0x4846782: strlen (vg_replace_strmem.c:494)
==127318==    by 0x121B61BE: ??? (in /usr/lib/x86_64-linux-gnu/libvulkan_lvp.so)
==127318==    by 0x579587B: ??? (in /usr/lib/x86_64-linux-gnu/libvulkan.so.1.3.239)
==127318==    by 0x577F577: ??? (in /usr/lib/x86_64-linux-gnu/libvulkan.so.1.3.239)
==127318==    by 0x4A072DE: VULKAN_INTERNAL_CreateBuffer (SDL_gpu_vulkan.c:4166)
==127318==    by 0x4A10D95: VULKAN_INTERNAL_DefragmentMemory (SDL_gpu_vulkan.c:10641)
==127318==    by 0x4A10D95: VULKAN_Submit (SDL_gpu_vulkan.c:10569)
==127318==    by 0x10EEB1: Render (testgpu_spinning_cube.c:457)
==127318==    by 0x10EEB1: loop (testgpu_spinning_cube.c:677)
==127318==    by 0x10EEB1: loop (testgpu_spinning_cube.c:666)
==127318==    by 0x10DDC4: main (testgpu_spinning_cube.c:745)
==127318==  Address 0x81 is not stack'd, malloc'd or (recently) free'd
==127318== 
==127318== 
==127318== Process terminating with default action of signal 11 (SIGSEGV)
==127318==  Access not within mapped region at address 0x81
==127318==    at 0x4846782: strlen (vg_replace_strmem.c:494)
==127318==    by 0x121B61BE: ??? (in /usr/lib/x86_64-linux-gnu/libvulkan_lvp.so)
==127318==    by 0x579587B: ??? (in /usr/lib/x86_64-linux-gnu/libvulkan.so.1.3.239)
==127318==    by 0x577F577: ??? (in /usr/lib/x86_64-linux-gnu/libvulkan.so.1.3.239)
==127318==    by 0x4A072DE: VULKAN_INTERNAL_CreateBuffer (SDL_gpu_vulkan.c:4166)
==127318==    by 0x4A10D95: VULKAN_INTERNAL_DefragmentMemory (SDL_gpu_vulkan.c:10641)
==127318==    by 0x4A10D95: VULKAN_Submit (SDL_gpu_vulkan.c:10569)
==127318==    by 0x10EEB1: Render (testgpu_spinning_cube.c:457)
==127318==    by 0x10EEB1: loop (testgpu_spinning_cube.c:677)
==127318==    by 0x10EEB1: loop (testgpu_spinning_cube.c:666)
==127318==    by 0x10DDC4: main (testgpu_spinning_cube.c:745)
==127318==  If you believe this happened as a result of a stack
==127318==  overflow in your program's main thread (unlikely but
==127318==  possible), you can try to increase the size of the
==127318==  main thread stack using the --main-stacksize= flag.
==127318==  The main thread stack size used in this run was 8388608.
==127318== 
==127318== HEAP SUMMARY:
==127318==     in use at exit: 108,576,693 bytes in 9,778 blocks
==127318==   total heap usage: 47,601 allocs, 37,823 frees, 189,414,856 bytes allocated
==127318== 
==127318== LEAK SUMMARY:
==127318==    definitely lost: 48 bytes in 1 blocks
==127318==    indirectly lost: 0 bytes in 0 blocks
==127318==      possibly lost: 814,704 bytes in 2,347 blocks
==127318==    still reachable: 107,761,941 bytes in 7,430 blocks
==127318==         suppressed: 0 bytes in 0 blocks
==127318== Rerun with --leak-check=full to see details of leaked memory
==127318== 
==127318== For lists of detected and suppressed errors, rerun with: -s
==127318== ERROR SUMMARY: 6 errors from 4 contexts (suppressed: 0 from 0)

@Green-Sky
Copy link
Contributor Author

Green-Sky commented Jan 21, 2025

testnative is fine though no matter how many times I resize it.

Yes, SDL_Render no longer defaults to the SDL_GPU backend.

@kg
Copy link
Contributor

kg commented Jan 21, 2025

I worked with some of the devs on discord to dig in a little further. A few discoveries:

  • There's a // little hack for defrag which uses ->container to smuggle a VulkanUniformBuffer *, which is technically maybe almost sort of safe except not really. Replacing that with a proper implementation gets to a segfault inside of defragmentation.
  • The segfault inside of defragmentation is here inside of DefragmentMemory:
    Image
    And based on prodding it with gdb it seems like the container is a bad pointer but i.e. the buffer is valid. I suspect this is related to how the window's containers are managed specially vs other containers, perhaps it's a dangling pointer to containers from the window before the resize that are no longer a live allocation. I've handed this off to the others to dig in further.
Thread 1 "testgpu_spinnin" received signal SIGSEGV, Segmentation fault.
VULKAN_INTERNAL_DefragmentMemory (renderer=0x55555568f8a0) at /home/kate/Projects/SDL/src/gpu/vulkan/SDL_gpu_vulkan.c:10648
10648               newBuffer = VULKAN_INTERNAL_CreateBuffer(
(gdb) info locals
allocation = 0x555555847e80
currentRegion = 0x555555a95280
newBuffer = 0x7ffff7ee8d7d <VULKAN_INTERNAL_PerformPendingDestroys+1527>
newTexture = 0x555cdd84
bufferCopy = {srcOffset = 140737488346280, dstOffset = 140737352865038, size = 140737488344816}
imageCopy = {srcSubresource = {aspectMask = 42893, mipLevel = 0, baseArrayLayer = 4113016360, layerCount = 0}, srcOffset = {x = -10656, y = 32767, z = -135368401}, dstSubresource = {
    aspectMask = 32767, mipLevel = 1434737952, baseArrayLayer = 21845, layerCount = 1432942752}, dstOffset = {x = 21845, y = -9048, z = 32767}, extent = {width = 4159477006, height = 32767, 
    depth = 1434689552}}
commandBuffer = 0x555555b889e0
srcSubresource = 0x0
dstSubresource = 0x555555a46840
i = 0
subresourceIndex = 4294967295
__func__ = "VULKAN_INTERNAL_DefragmentMemory"

(gdb) p $_siginfo._sifields._sigfault.si_addr
$4 = (void *) 0x555000fc09dc
(gdb) print *currentRegion->vulkanBuffer->container
Cannot access memory at address 0x555000fc09bc
(gdb) print currentRegion->vulkanBuffer->buffer
$5 = (VkBuffer) 0x5555558375f0
(gdb) print *currentRegion->vulkanBuffer->buffer
$6 = <incomplete type>

icculus pushed a commit that referenced this issue Jan 21, 2025
@icculus
Copy link
Collaborator

icculus commented Jan 21, 2025

Okay, I've pushed @kg's fix, which seems to resolve the memory access issues, and then one on top of it which fixes swapchain texture acquisition over here for me. Please retest the latest in main asap if you were having problems with this, it's our last bug before shipping 3.2.0! :)

icculus pushed a commit that referenced this issue Jan 21, 2025
@icculus
Copy link
Collaborator

icculus commented Jan 21, 2025

@kg had some extra difficulties she resolved--we think they were exposed by running in a virtual machine--plus some other good fixes, in that last commit.

@Green-Sky
Copy link
Contributor Author

I can confirm, the spinning cube does not longer crash or hang when resizing on latest master 🥳 .

I double checked with asan enabled.

I also bisected and 6d5815d was the one that fixed it for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

7 participants