Move MeshUniform allocation from the CPU to the GPU.#23662
Conversation
The goal of GPU-driven rendering is to cache the entire scene graph on the GPU in a form that's efficient for rendering and, for objects that didn't change since the previous frame, to have zero CPU-side overhead. If the scene didn't change, the only CPU overhead should be proportional to the number of multi-draw indirect calls. PR bevyengine#23481 eliminated the CPU loop over every mesh *instance* in rendering, which brought us closer to this ideal, but it didn't fully get us there, because there's still a CPU loop over every *mesh*. Although there are usually many fewer meshes than mesh instances in large scenes, this still represents a potential bottleneck on complex scenes and/or on lower-end hardware. This CPU loop exists to allocate `MeshUniform`s, which are the data structures that the GPU transform-and-cull stage stores the post-transform data in. Unlike `MeshInputUniform`s, which are scattered throughout memory and allocated using a CPU-side free list, `MeshUniform`s are indexed by *instance ID*. Because of the way multi-draw indirect assigns instance IDs, all instances of a specific mesh must be adjacent to one another. This necessitates a global allocation pass that lays out `MeshUniform`s in memory such that all the instances of a specific mesh end up adjacent to one another. This operation is currently performed on the CPU in the `MultidrawableBatchSetPreparer::prepare_multidrawable_binned_batch_set` method and has overhead proportional to the number of separate meshes (not mesh *instances*) in each batch set. This PR addresses the problem by moving the sequential loop in that method to the GPU. A new GPU phase known as the *uniform allocation* step has been added. This shader essentially performs a [prefix sum] in order to allocate the `MeshUniform`s corresponding to the batches within a batch set. This isn't the first prefix sum operation that we have in Bevy: PR bevyengine#23036 added a prefix sum for light clustering. However, in order to scale better to tens of thousands of meshes in a single batch set (i.e. multi-draw command), the uniform allocation pass added in this PR uses the three-step *scan and fan* process rather than the two-step process that PR bevyengine#23036 uses. The scan and fan algorithm works as follows: 1. *Local allocation*: Perform a [Hillis-Steele scan] on chunks of size equal to the workgroup size (256, in this case), producing a prefix sum for each 256-element block. Write the final sum of each chunk to a *fan buffer*. 2. *Global allocation*: Perform a Hillis-Steele scan on the fan buffer and write the results. Now each chunk can determine the running total leading into that chunk. 3. *Fan*: For each chunk, add the running total leading into that chunk to every one of that chunk's elements. Note that, if the number of meshes is lower than the workgroup size, we only need step (1) above and can skip steps (2) and (3). Because batch sets rarely contain over 256 meshes, this means that in real-world scenes we typically only need to run step (1). This patch had to rework the `RenderMultidrawableBatchSet` structure added in PR bevyengine#23481 in order to perform additional bookkeeping necessary to keep the time complexity of adding a mesh instance O(1). The `proptest`-based test suite has been updated and extended significantly to deal with this additional complexity. For static meshes without skins and morph target, this PR eliminates the last remaining per-mesh overhead in the render schedules, with the exceptions of (a) the full ECS table scans required for change detection and (b) the overhead of reuploading the various GPU buffers. Change indexes (PR bevyengine#23519) address issue (a), and more use of `SparseBufferVec` (PR bevyengine#23242) will address issue (b). [prefix sum]: https://en.wikipedia.org/wiki/Prefix_sum [Hillis-Steele scan]: https://en.wikipedia.org/wiki/Prefix_sum#Algorithm_1:_Shorter_span,_more_parallel
|
Your PR caused a change in the graphical output of an example or rendering test. This might be intentional, but it could also mean that something broke! If it's expected, please add the M-Deliberate-Rendering-Change label. If this change seems unrelated to your PR, you can consider updating your PR to target the latest main branch, either by rebasing or merging main into it. |
|
AlephCubed found a performance regression with this PR compared to main: #24448 (comment) |
|
That performance regression isn’t unexpected because that PR used one batch set per drawcall. This isn’t how Bevy’s renderer is intended to be used; batch sets are always expensive no matter what. The bindless version of that PR is the “proper way” to do it and results in huge speedups, as noted there. This PR should be ready for review again. I’d like to get it in relatively early for the 0.20 cycle. It won’t result in performance improvements on any of our current benchmarks, but it’s infrastructural work that will get harder the longer we wait. It brings us down to zero per-mesh CPU overhead as long as the meshes are in the same batch set, which is important for scaling. |
| } | ||
|
|
||
| // Save the sum coming out of this block for the next one. | ||
| sum = output_offsets[WORKGROUP_SIZE - 1u]; |
There was a problem hiding this comment.
does this need a barrier or is the comment above referring to not really needing it?
There was a problem hiding this comment.
hillis_steele_scan() does a workgroup barrier after modifying output_offsets, so we don’t need another one here.
There was a problem hiding this comment.
I had the same question when I was going through this last night, might be good to add a comment to make it more obvious
There was a problem hiding this comment.
OK, I added a comment.
| main_entity: MainEntity, | ||
| input_uniform_index: InputUniformIndex, | ||
| ) { | ||
| if (input_uniform_index.0 as usize) >= self.mesh_input_uniform_index_to_entity.len() { |
There was a problem hiding this comment.
Why index this by the global uniform index instead of keeping it dense? Seems like it'd balloon to the max index.
There was a problem hiding this comment.
By dense do you mean sparse (like do you mean “why didn’t I use a hash table”?) If so, input uniform indices are allocated sequentially via a free list, so the amount of wasted space should be low in practice.
There was a problem hiding this comment.
Sorry, "dense" was the wrong word -- by dense I meant indexed by the local instance position (0..instance_count) like the old render_binned_mesh_instances_cpu — not a hashtable. The free list does keep the global buffer packed, but this Vec is per-batch-set and indexed by the global index, so it's sized to the max index in the set, not its instance count. That's only ~instance_count if the set's members happen to be contiguous — which breaks with many batch sets, or when visibility churn reshuffles the free list across sets, pushing each set's Vec toward the global high water mark.
There was a problem hiding this comment.
That’s a good point! I pushed a new commit that makes that change.
|
While I was reviewing this I asked an AI to do a correctness check in the background, and it found a reproducible crash. Adding this extra entity to the (
Mesh3d(asset_value(Cuboid::new(1.0, 1.0, 1.0)))
MeshMaterial3d::<StandardMaterial>(asset_value(Color::srgb_u8(255, 100, 100)))
Transform::from_xyz(2.0, 0.5, 0.0)
NoAutomaticBatching
)Gives the following backtrace on my machine (Linux/Vulkan): I spent some time trying to understand the reason, but it's a bit complicated. Here is the unedited reason the AI gave me, which seems reasonable to me but of course might be wrong. I hope at least the crash itself is useful.
|
before processing multidrawables for any view.
|
@kristoff3r Good catch. The problem was that, although we build unbatchables and batchables before multidrawables for a single view, those buffers are in fact shared among all views. I changed the logic so that we prepare unbatchables and batchables for all views before preparing multidrawables for any view, which fixes the problem. |
| if (global_id < block_end) { | ||
| output_offsets[local_id.x] = sum + fan_buffer[global_id]; | ||
| } | ||
| workgroupBarrier(); | ||
|
|
||
| // Perform the scan. | ||
| hillis_steele_scan(local_id.x); | ||
|
|
||
| // Write the value back. | ||
| // Note that we don't need a workgroup barrier here because | ||
| // `hillis_steele_scan` already did one. | ||
| if (global_id < block_end) { | ||
| fan_buffer[global_id] = output_offsets[local_id.x]; | ||
| } | ||
|
|
||
| // Save the sum coming out of this block for the next one. | ||
| sum = output_offsets[WORKGROUP_SIZE - 1u]; |
There was a problem hiding this comment.
This is wrong for more than 64k meshes, because it adds the sum before the scan. Before 64k the sum is always 0, which is why it works for "small" values.
I ran it with a small example locally which produced visible artifacts, and with these changes it seems identical with main.
I also think we technically need a barrier between the loop iterations?
| if (global_id < block_end) { | |
| output_offsets[local_id.x] = sum + fan_buffer[global_id]; | |
| } | |
| workgroupBarrier(); | |
| // Perform the scan. | |
| hillis_steele_scan(local_id.x); | |
| // Write the value back. | |
| // Note that we don't need a workgroup barrier here because | |
| // `hillis_steele_scan` already did one. | |
| if (global_id < block_end) { | |
| fan_buffer[global_id] = output_offsets[local_id.x]; | |
| } | |
| // Save the sum coming out of this block for the next one. | |
| sum = output_offsets[WORKGROUP_SIZE - 1u]; | |
| if (global_id < block_end) { | |
| output_offsets[local_id.x] = fan_buffer[global_id]; | |
| } | |
| workgroupBarrier(); | |
| // Perform the scan. | |
| hillis_steele_scan(local_id.x); | |
| // Write the value back. | |
| // Note that we don't need a workgroup barrier here because | |
| // `hillis_steele_scan` already did one. | |
| if (global_id < block_end) { | |
| fan_buffer[global_id] = output_offsets[local_id.x] + sum; | |
| } | |
| // Save the sum coming out of this block for the next one. | |
| sum += output_offsets[WORKGROUP_SIZE - 1u]; | |
| workgroupBarrier(); |
There was a problem hiding this comment.
The lack of barrier is fine, we have workgroupBarriers before and hillis_steele_scan should always end with one as well.
There was a problem hiding this comment.
Ah yes, you’re right on both counts. The workgroup barrier is necessary to avoid the write-after-read hazard and the sum has to be added in after the Hillis-Steele scan, not before, to avoid propagating the sum value to all the entries.
|
New commit should fix the problem. |
| } | ||
|
|
||
| // Save the sum coming out of this block for the next one. | ||
| sum = output_offsets[WORKGROUP_SIZE - 1u]; |
There was a problem hiding this comment.
You overlooked a tiny part of the diff. Since sum is no longer part of output_offsets at this point in the loop, we need to do sum = sum + output_offsets[WORKGROUP_SIZE - 1u];, or equivalently just adding the last number to the current sum.
| sum = output_offsets[WORKGROUP_SIZE - 1u]; | |
| sum += output_offsets[WORKGROUP_SIZE - 1u]; |
There was a problem hiding this comment.
Oh, sorry, yes, you’re right. Fixed.
Aceeri
left a comment
There was a problem hiding this comment.
Looks good! Some questions/nitpicks for some clarity.
| // array<MeshInput>; | ||
| (0, storage_buffer_read_only::<MeshInputUniform>(false)), | ||
| // @group(0) @binding(1) var<storage> indirect_parameters_metadata: | ||
| // array<IndirectParametersMetadata>; |
There was a problem hiding this comment.
I like the marking of group/bindings/type here. Need to start doing this in my own code.
| pub first_output_mesh_uniform_index: u32, | ||
|
|
||
| /// Padding. | ||
| pub pad: [u32; 60], |
There was a problem hiding this comment.
Why does this need to be 256 byte aligned? Can a uniform not be just 16 bytes? I'm probably missing something here.
There was a problem hiding this comment.
It depends on the platform. I forget which one demands 256 bytes—it’s either macOS or WebGPU. But it’s one of them.
`MeshInputUniform` index as the key to index each entity in the multidrawable batch set buffers. As pointed out by @stuartparmenter (thanks!), using the `MeshInputUniform` is wasteful of memory and needlessly complicated.
stuartparmenter
left a comment
There was a problem hiding this comment.
Thanks for addressing comments, looks good to me!
The goal of GPU-driven rendering is to cache the entire scene graph on the GPU in a form that's efficient for rendering and, for objects that didn't change since the previous frame, to have zero CPU-side overhead. If the scene didn't change, the only CPU overhead should be proportional to the number of multi-draw indirect calls. PR #23481 eliminated the CPU loop over every mesh instance in rendering, which brought us closer to this ideal, but it didn't fully get us there, because there's still a CPU loop over every mesh. Although there are usually many fewer meshes than mesh instances in large scenes, this still represents a potential bottleneck on complex scenes and/or on lower-end hardware.
This CPU loop exists to allocate
MeshUniforms, which are the data structures that the GPU transform-and-cull stage stores the post-transform data in. UnlikeMeshInputUniforms, which are scattered throughout memory and allocated using a CPU-side free list,MeshUniforms are indexed by instance ID. Because of the way multi-draw indirect assigns instance IDs, all instances of a specific mesh must be adjacent to one another. This necessitates a global allocation pass that lays outMeshUniforms in memory such that all the instances of a specific mesh end up adjacent to one another. This operation is currently performed on the CPU in theMultidrawableBatchSetPreparer::prepare_multidrawable_binned_batch_setmethod and has overhead proportional to the number of separate meshes (not mesh instances) in each batch set.This PR addresses the problem by moving the sequential loop in that method to the GPU. A new GPU phase known as the uniform allocation step has been added. This shader essentially performs a prefix sum in order to allocate the
MeshUniforms corresponding to the batches within a batch set. This isn't the first prefix sum operation that we have in Bevy: PR #23036 added a prefix sum for light clustering. However, in order to scale better to tens of thousands of meshes in a single batch set (i.e. multi-draw command), the uniform allocation pass added in this PR uses the three-step scan and fan process rather than the two-step process that PR #23036 uses. The scan and fan algorithm works as follows:Local allocation: Perform a Hillis-Steele scan on chunks of size equal to the workgroup size (256, in this case), producing a prefix sum for each 256-element block. Write the final sum of each chunk to a fan buffer.
Global allocation: Perform a Hillis-Steele scan on the fan buffer and write the results. Now each chunk can determine the running total leading into that chunk.
Fan: For each chunk, add the running total leading into that chunk to every one of that chunk's elements.
Note that, if the number of meshes is lower than the workgroup size, we only need step (1) above and can skip steps (2) and (3). Because batch sets rarely contain over 256 meshes, this means that in real-world scenes we typically only need to run step (1).
This patch had to rework the
RenderMultidrawableBatchSetstructure added in PR #23481 in order to perform additional bookkeeping necessary to keep the time complexity of adding a mesh instance O(1). Theproptest-based test suite has been updated and extended significantly to deal with this additional complexity.For static meshes without skins and morph targets, this PR eliminates the last remaining per-mesh overhead in the render schedules, with the exceptions of (a) the full ECS table scans required for change detection and (b) the overhead of reuploading the various GPU buffers. Change indexes (PR #23519) address issue (a), and more use of
SparseBufferVec(PR #23242) will address issue (b).