Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDL3 GPU WebGPU Backend #12046

Open
wants to merge 41 commits into
base: main
Choose a base branch
from
Open

SDL3 GPU WebGPU Backend #12046

wants to merge 41 commits into from

Conversation

klukaszek
Copy link

Description

Congrats on shipping SDL 3.20, and officially releasing SDL3!

Now that SDL3 has been released, I have decided to open a PR for my work for the WebGPU backend as suggested by @flibitijibibo.

Attached is a checklist of the API methods, as well as a checklist of working examples. (As of 2025-01-21).

Examples and more info can be found at: https://github.com/klukaszek/SDL3-WebGPU-Examples
(Based on https://github.com/TheSpydog/SDL_gpu_examples/)

A live demo can be found at: https://kylelukaszek.com/SDL3-WebGPU-Examples/.

My fork currently fails to pass the Emscripten pipeline test for some reason that I haven't taken the time to investigate yet. So that will probably have to be resolved before merging with main.

I'm probably gonna get to work on compute pipelines sometime soon if no one ends up working on that by the time I'm free again.

Shaders

This current implementation of the backend expects WGSL shaders since I have only tested on browsers, and browser implementations of WebGPU don't offer support for the SPIRV SType. Once native WGPU support becomes a priority, then this issue can be tackled.

API Checklist

General

  • DestroyDevice
  • SupportsPresentMode
  • ClaimWindow
  • ReleaseWindow

Swapchains

  • SetSwapchainParameters
  • SupportsTextureFormat
  • SupportsSampleCount
  • SupportsSwapchainComposition

Command Buffers and Fences

  • AcquireCommandBuffer
  • AcquireSwapchainTexture
  • GetSwapchainTextureFormat
  • Submit
  • SubmitAndAcquireFence (Should just call Submit)
  • Cancel (Should be no-op for WebGPU)
  • Wait (Should be no-op for WebGPU)
  • WaitForFences (Should be no-op for WebGPU)
  • QueryFence (Should be no-op for WebGPU)
  • ReleaseFence (Should be no-op for WebGPU)

Note: WebGPU has no exposed fence API.

Buffers

  • CreateBuffer
  • ReleaseBuffer
  • SetBufferName
  • CreateTransferBuffer
  • ReleaseTransferBuffer
  • MapTransferBuffer
  • UnmapTransferBuffer
  • UploadToBuffer
  • DownloadFromBuffer
  • CopyBufferToBuffer

Textures

Samplers

  • CreateSampler
  • ReleaseSampler

Debugging

  • InsertDebugLabel
  • PushDebugGroup
  • PopDebugGroup

Graphics Pipelines

  • CreateGraphicsPipeline
  • BindGraphicsPipeline
  • ReleaseGraphicsPipeline

Compute Pipelines

  • CreateComputePipeline
  • BindComputePipeline
  • ReleaseComputePipeline

Shaders

  • CreateShader
  • ReleaseShader

Rendering

  • BeginRenderPass
  • EndRenderPass
  • DrawPrimitivesIndirect
  • DrawPrimitives
  • DrawIndexedPrimitives
  • DrawIndexedPrimitivesIndirect

Copy Passes

  • BeginCopyPass
  • EndCopyPass

Compute Passes

  • BeginComputePass
  • EndComputePass
  • DispatchCompute
  • DispatchComputeIndirect
  • BindComputeSamplers
  • BindComputeStorageTextures
  • BindComputeStorageBuffers
  • PushComputeUniformData

Fragment Stage

  • BindFragmentSamplers
  • BindFragmentStorageTextures
  • BindFragmentStorageBuffers
  • PushFragmentUniformData
    • Needs to be rewritten.

Vertex Stage

  • BindVertexBuffers
  • BindIndexBuffer
  • BindVertexSamplers
  • BindVertexStorageTextures
  • BindVertexStorageBuffers
  • PushVertexUniformData
    • Needs to be rewritten.

Rendering States

  • SetViewport
  • SetScissor
  • SetBlendConstants
  • SetStencilReference

Composition

  • Blit
    • Mostly functional.
    • Bug: Example "Blit2DArray.c" has a sampler issue where the RHS is not downsampled.
    • Bug: Example "TriangleMSAA.c" does not cycle between different sample counts.

Example Checklist

  • ClearScreen.c
  • BasicTriangle.c
  • BasicVertexBuffer.c
  • CullMode.c
  • BasicStencil.c
  • InstancedIndexed.c
  • TexturedQuad.c
  • Texture2DArray.c
  • TexturedAnimatedQuad.c
    • Example loads with no warnings, but nothing draws.
    • Needs to be investigated.
  • Clear3DSlice.c,
  • Blit2DArray.c
    • Sampler issue on right texture with no warnings, but draws.
    • Needs to be investigated.
  • BlitCube.c
  • BlitMirror.c
    • Example loads with no warnings, but nothing draws.
    • Needs to be investigated.
  • Cubemap.c
  • CopyAndReadback.c
  • CopyConsistency.c
  • BasicCompute.c
  • ComputeUniforms.c
  • ToneMapping.c
  • CustomSampling.c
  • DrawIndirect.c
  • ComputeSpriteBatch.c
  • TriangleMSAA.c
    • Draws properly, but no visible change occurs when changing the sample count.
    • Needs to be investigated.
  • WindowResize.c (Resizes browser canvas. Have not tested anything natively.
  • GenerateMipmaps.c

Native WebGPU Support

I have not done any testing with native distributions of WebGPU (WGPU Native / Dawn), though I have implemented Elie Michel's surface selector logic sdl3webgpu.c for when someone wants to give it a test.

Warning:
The preprocessor macros in WebGPU_INTERNAL_CreateSurface() don't seem to work properly, and as a result, I hard coded in a workaround since I'm only testing on the web for the time being.

Existing Issue(s)

#10768

…gh all of the commits were the one I just rebased... Fixed everything back up.
…PU objects aren't being released via the bindings. Might be an actual bug with Emscripten's bindings specifically, need more info.

Working on a solution for uniform functions in SDL3. WebGPU BindGroups make this specific approach tough to handle. Assume uniform struct is stored at group 0 binding 0, contents should be 1 buffer FOR NOW.
…ere is no reason for them to mimic the Vulkan implementation. Added GPU API checklist. Next will be vertex and fragment uniform buffers.

Updated checklist
… crashes, but nothing renders properly. Need to investigate further.
…a bunch of existing bugs with the backend. Still encountering a layerCount issue that I cannot verify. My debugger says the texture and texture view both have 4 layers, but the error says that the texture's array layer count is 1.
…allows views of 1 layer for color attachments...
… pipelines. Now we create internal SDL pipelines and everything is handled nicely. 3D texture example still works.
…gate why the sampler isn't working in the Blit2DArray example.
…no longer needed outside of the frame. Minimizes heap resizing
… more static allocations now. Static allocations only occur on named object creation, and when dealing with PipelineLayouts. Planning on refactoring PipelineLayouts later.
… the emscripten keyboard event handlers when no hint was set.
… configure the surface. Elie Michel's surface configuration logic was added but the macros don't seem to want to work for me. I've added a temporary workaround since I am only testing Emscripten anyways.
@slouken slouken added this to the 3.4.0 milestone Jan 21, 2025
@slouken
Copy link
Collaborator

slouken commented Jan 21, 2025

Congrats on the awesome progress!

Comment on lines 1411 to 1418
while (SDL_GetAtomicInt(&buffer->mappingComplete) != 1) {
if (SDL_GetTicks() - startTime > TIMEOUT) {
SDL_LogError(SDL_LOG_CATEGORY_GPU, "Failed to map buffer: timeout");
return NULL;
}

SDL_Delay(1);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This spin-wait is a huge red flag. Generally speaking browser async operations should not be implemented this way. I would be very concerned that this will break on certain targets since generally async stuff on the web is specified to not be observable until the event loop turns; if this happens to work it could break in the future and nobody would know what was going on.

At a minimum you should have a comment here that specifies why it's safe/appropriate to do this instead of doing something else (I don't know what else you'd do offhand) - i.e. 'here's the part of the WebGPU spec that says this is legal and the spin should complete quickly' or 'i tested this on and on and '.

Thankfully this appears to only apply to readback which makes it have less of an impact on the overall API; it might be that what you need to do is specify an async readback API extension to SDL_GPU and make that the only legal way to do readback on the WebGPU target.

Blocking the browser's main thread (for up to 1000ms in this case) is very bad. It causes all sorts of downstream problems.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll throw some comments in! I'll also have to add some preprocessor macros to ensure that SDL_Delay(1) calls are specific to Emscripten. This is done since browser backends for WebGPU don't give access to device ticking, so we have to yield back to the browser for a tiny amount of time for the backend to tick the device for us.

Here's a quote from Elie Michel:

"When our C++ code runs in a Web browser (after being compiled to WebAssembly through emscripten), there is no explicit way to tick/poll the WebGPU device. This is because the device is managed by the Web browser itself, which decides at what pace polling should happen. As a result:

The device never ticks in between two consecutive lines of our WebAssembly module, it can only tick when the execution flow leaves the module.

The device always ticks between two calls to our MainLoop() function, because if you remember the Emscripten section of the Opening a Window chapter, we leave the main loop management to the browser and only provide a callback to run at each frame.

Thanks to the second point, we do not need wgpuPollEvents to do anything when called at the beginning or end of our main loop (so we set yieldToWebBrowser to false).

However, if what we intend is really to wait until something happens (e.g., a callback gets invoked), the first point requires us to make sure we yield back the execution flow to the Web browser, so that it may tick its device from time to time. We do this thanks to emscripten_sleep function, at the cost of effectively sleeping during 100 ms (we’re in a case where we want to wait anyways).

Note that using emscripten_sleep requires the -SASYNCIFY link option to be passed to emscripten, like we added already."

Copy link
Collaborator

@thatcosmonaut thatcosmonaut Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specify an async readback API extension to SDL_GPU

We have an async readback API, it's the Download and QueryFence/WaitForFence functions. If the committee can't define their specification for this extremely common use case in a normal way like every single industry-standard API going back to D3D11 that is firmly their problem. I would rather force the webGPU backend to implement a hack to make it work our way than poison our API with something as stupid as an async buffer map call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so because you're relying on asyncify being set (I missed this, sorry! my bad) the sleep is not a spinwait but is instead a yield-to-browser-event-loop. That's much better.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so I'm not only being grumpy in this thread, here's a quick sketch of how this could possibly work:

A "fence" in the webGPU backend could just be defined as a group of resources that are waiting on async map operations. Then implementing QueryFence would be as simple as checking buffer->mappingComplete for each of these resources. WaitForFence could be implemented with the spinwait. That might be enough for this to work.

Comment on lines 4483 to 4485
while (!renderer->device) {
SDL_Delay(1);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a forward progress guarantee here? Please specify what provides the guarantee of forward progress. A naive reading of this suggests that it might never stop spinning since there's no timeout. It would be nice to at least see a timeout here and have it error out when the timeout expires.

It would be even better to not have this spin-wait. It's a red flag and doesn't seem like it should be necessary if everything is working correctly, it suggests that someone - not necessarily you, it could be the browser vendor or the user mode graphics driver - got something wrong.

Worst-case this spin wait could actually prevent forward progress if something important is waiting in the event loop queue.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of checking the device pointer itself, I can add some bool that gets toggled by the RequestDeviceCallback.

If the status received by the callback is anything but successful, then we say that it failed which would then terminate the quoted infinite loop.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See: 11d8ef7

@thatcosmonaut
Copy link
Collaborator

Looks like there was a bad rebase because some of the enum entries gpu.c have been randomly deleted, etc. The includes need to be cleaned up too.

@klukaszek
Copy link
Author

Looks like there was a bad rebase because some of the enum entries gpu.c have been randomly deleted, etc. The includes need to be cleaned up too.

I reckon it was in here: 850caed

Copy link
Collaborator

@thatcosmonaut thatcosmonaut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left comments on all the obvious stuff I noticed for now.

I'll also note here that cycling hasn't been implemented for any resources.

Comment on lines +530 to +532
#ifdef __EMSCRIPTEN__
SDL_SetHint(SDL_HINT_GPU_DRIVER, "webgpu");
#endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't right, we shouldn't be depending on emscripten since webgpu can also have native implementations.

bool is_webgpu = SDL_strcasecmp(backend, "webgpu") == 0;

// WebGPU uses ~0u for default layer_or_depth_plane, however this causes issues with other backends
if (color_target_infos[i].layer_or_depth_plane == ~0u && !is_webgpu) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be translating from SDL to WGPU, not the other way around. If the client passes in ~0u for the layer then that violates our spec.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link: c1d8428

Comment on lines 1416 to 1418
// Get hint to check for "webgpu"
const char *backend = SDL_GetHint(SDL_HINT_GPU_DRIVER);
bool is_webgpu = SDL_strcasecmp(backend, "webgpu") == 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to query hints to get the backend from gpu.c

@@ -18,6 +18,7 @@
misrepresented as being the original software.
3. This notice may not be removed or altered from any source distribution.
*/
#include "../SDL_internal.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect #include

@@ -20,6 +20,7 @@
*/
#include "SDL_internal.h"
#include "SDL_sysgpu.h"
#include <SDL3/SDL_gpu.h>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect #include

.label = "SDL_GPU Command Encoder",
};

commandBuffer->commandEncoder = wgpuDeviceCreateCommandEncoder(renderer->device, &commandEncoderDesc);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to pool the command buffer structures than creating a new command encoder every frame.

int width, height;
SDL_GetWindowSize(renderer->claimedWindows[0]->window, &width, &height);
commandBuffer->currentViewport = (WebGPUViewport){ 0, 0, width, height, 0.0, 1.0 };
commandBuffer->currentScissor = (WebGPURect){ 0, 0, width, height };
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this function touching windows? This should be done in BeginRenderPass.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to BeginRenderPass. I'll link commit once it's up.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link: 8d601ec

{
// Just call Submit for WebGPU
WebGPU_Submit(commandBuffer);
// There are no fences in WebGPU, so we don't need to do anything here
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not having any kind of fence abstraction is going to break tons of applications.

It seems like there's some kind of pseudo-fence callback structure:
https://developer.mozilla.org/en-US/docs/Web/API/GPUQueue/onSubmittedWorkDone

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just adding stuff here as notes for myself when I return:

In the C API, the function is defined as: wgpuQueueOnSubmittedWorkDone(WGPUQueue queue, WGPUQueueWorkDoneCallback callback, void *userdata).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, then this can probably be implemented by just having a Fence struct as the userdata and then marking it as finished in the callback.

// Slightly altered, though with permission by Elie Michel:
// @ https://github.com/eliemichel/sdl3webgpu/blob/main/sdl3webgpu.c
// https://github.com/libsdl-org/SDL/issues/10768#issuecomment-2499532299
#if defined(SDL_PLATFORM_MACOS)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't be touching platform code in the implementation like this. We'll probably need some kind of platform abstraction in SDL itself that can get a WGPU surface.


bool cycleBindGroups;

WebGPUUniformBuffer vertexUniformBuffers[MAX_UNIFORM_BUFFERS_PER_STAGE];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pipeline should not own these, uniform buffers should be pooled.

@@ -0,0 +1,4602 @@
// File: /webgpu/SDL_gpu_webgpu.c
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include the standard text from https://github.com/libsdl-org/SDL/blob/main/include/SDL3/SDL_copying.h and add any copyright attribution you'd like here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link: a385d47

@@ -2090,6 +2106,11 @@ void WebGPU_BeginRenderPass(SDL_GPUCommandBuffer *commandBuffer,
return;
}

int width, height;
SDL_GetWindowSize(wgpu_cmd_buf->renderer->claimedWindows[0]->window, &width, &height);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still not right, the viewport and scissor should be set to the smallest size of bound render targets. Please reference how the other backends implemented this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I get it now! I'll return to this after some rest I think.

I read up on the Vulkan implementation and will follow that one tomorrow.

Copy link
Author

@klukaszek klukaszek Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link: e24094d

It's still not 1-to-1 with the Vulkan backend but the viewport and scissor now use the smallest available size of all bound render targets.

It also now sets the other default states for the render pass.

// Note: Compiling SDL GPU programs using emscripten will require -sUSE_WEBGPU=1 -sASYNCIFY=1

#include "../SDL_sysgpu.h"
#include "SDL_internal.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SDL_internal.h needs to be the first include in the file. I usually throw it right after the standard blurb at the top so I don't forget.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link: d2fbc02

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants