Move `MeshUniform` allocation from the CPU to the GPU. by pcwalton · Pull Request #23662 · bevyengine/bevy

pcwalton · 2026-04-04T20:50:55Z

The goal of GPU-driven rendering is to cache the entire scene graph on the GPU in a form that's efficient for rendering and, for objects that didn't change since the previous frame, to have zero CPU-side overhead. If the scene didn't change, the only CPU overhead should be proportional to the number of multi-draw indirect calls. PR #23481 eliminated the CPU loop over every mesh instance in rendering, which brought us closer to this ideal, but it didn't fully get us there, because there's still a CPU loop over every mesh. Although there are usually many fewer meshes than mesh instances in large scenes, this still represents a potential bottleneck on complex scenes and/or on lower-end hardware.

This CPU loop exists to allocate MeshUniforms, which are the data structures that the GPU transform-and-cull stage stores the post-transform data in. Unlike MeshInputUniforms, which are scattered throughout memory and allocated using a CPU-side free list, MeshUniforms are indexed by instance ID. Because of the way multi-draw indirect assigns instance IDs, all instances of a specific mesh must be adjacent to one another. This necessitates a global allocation pass that lays out MeshUniforms in memory such that all the instances of a specific mesh end up adjacent to one another. This operation is currently performed on the CPU in the MultidrawableBatchSetPreparer::prepare_multidrawable_binned_batch_set method and has overhead proportional to the number of separate meshes (not mesh instances) in each batch set.

This PR addresses the problem by moving the sequential loop in that method to the GPU. A new GPU phase known as the uniform allocation step has been added. This shader essentially performs a prefix sum in order to allocate the MeshUniforms corresponding to the batches within a batch set. This isn't the first prefix sum operation that we have in Bevy: PR #23036 added a prefix sum for light clustering. However, in order to scale better to tens of thousands of meshes in a single batch set (i.e. multi-draw command), the uniform allocation pass added in this PR uses the three-step scan and fan process rather than the two-step process that PR #23036 uses. The scan and fan algorithm works as follows:

Local allocation: Perform a Hillis-Steele scan on chunks of size equal to the workgroup size (256, in this case), producing a prefix sum for each 256-element block. Write the final sum of each chunk to a fan buffer.
Global allocation: Perform a Hillis-Steele scan on the fan buffer and write the results. Now each chunk can determine the running total leading into that chunk.
Fan: For each chunk, add the running total leading into that chunk to every one of that chunk's elements.

Note that, if the number of meshes is lower than the workgroup size, we only need step (1) above and can skip steps (2) and (3). Because batch sets rarely contain over 256 meshes, this means that in real-world scenes we typically only need to run step (1).

This patch had to rework the RenderMultidrawableBatchSet structure added in PR #23481 in order to perform additional bookkeeping necessary to keep the time complexity of adding a mesh instance O(1). The proptest-based test suite has been updated and extended significantly to deal with this additional complexity.

For static meshes without skins and morph targets, this PR eliminates the last remaining per-mesh overhead in the render schedules, with the exceptions of (a) the full ECS table scans required for change detection and (b) the overhead of reuploading the various GPU buffers. Change indexes (PR #23519) address issue (a), and more use of SparseBufferVec (PR #23242) will address issue (b).

The goal of GPU-driven rendering is to cache the entire scene graph on the GPU in a form that's efficient for rendering and, for objects that didn't change since the previous frame, to have zero CPU-side overhead. If the scene didn't change, the only CPU overhead should be proportional to the number of multi-draw indirect calls. PR bevyengine#23481 eliminated the CPU loop over every mesh *instance* in rendering, which brought us closer to this ideal, but it didn't fully get us there, because there's still a CPU loop over every *mesh*. Although there are usually many fewer meshes than mesh instances in large scenes, this still represents a potential bottleneck on complex scenes and/or on lower-end hardware. This CPU loop exists to allocate `MeshUniform`s, which are the data structures that the GPU transform-and-cull stage stores the post-transform data in. Unlike `MeshInputUniform`s, which are scattered throughout memory and allocated using a CPU-side free list, `MeshUniform`s are indexed by *instance ID*. Because of the way multi-draw indirect assigns instance IDs, all instances of a specific mesh must be adjacent to one another. This necessitates a global allocation pass that lays out `MeshUniform`s in memory such that all the instances of a specific mesh end up adjacent to one another. This operation is currently performed on the CPU in the `MultidrawableBatchSetPreparer::prepare_multidrawable_binned_batch_set` method and has overhead proportional to the number of separate meshes (not mesh *instances*) in each batch set. This PR addresses the problem by moving the sequential loop in that method to the GPU. A new GPU phase known as the *uniform allocation* step has been added. This shader essentially performs a [prefix sum] in order to allocate the `MeshUniform`s corresponding to the batches within a batch set. This isn't the first prefix sum operation that we have in Bevy: PR bevyengine#23036 added a prefix sum for light clustering. However, in order to scale better to tens of thousands of meshes in a single batch set (i.e. multi-draw command), the uniform allocation pass added in this PR uses the three-step *scan and fan* process rather than the two-step process that PR bevyengine#23036 uses. The scan and fan algorithm works as follows: 1. *Local allocation*: Perform a [Hillis-Steele scan] on chunks of size equal to the workgroup size (256, in this case), producing a prefix sum for each 256-element block. Write the final sum of each chunk to a *fan buffer*. 2. *Global allocation*: Perform a Hillis-Steele scan on the fan buffer and write the results. Now each chunk can determine the running total leading into that chunk. 3. *Fan*: For each chunk, add the running total leading into that chunk to every one of that chunk's elements. Note that, if the number of meshes is lower than the workgroup size, we only need step (1) above and can skip steps (2) and (3). Because batch sets rarely contain over 256 meshes, this means that in real-world scenes we typically only need to run step (1). This patch had to rework the `RenderMultidrawableBatchSet` structure added in PR bevyengine#23481 in order to perform additional bookkeeping necessary to keep the time complexity of adding a mesh instance O(1). The `proptest`-based test suite has been updated and extended significantly to deal with this additional complexity. For static meshes without skins and morph target, this PR eliminates the last remaining per-mesh overhead in the render schedules, with the exceptions of (a) the full ECS table scans required for change detection and (b) the overhead of reuploading the various GPU buffers. Change indexes (PR bevyengine#23519) address issue (a), and more use of `SparseBufferVec` (PR bevyengine#23242) will address issue (b). [prefix sum]: https://en.wikipedia.org/wiki/Prefix_sum [Hillis-Steele scan]: https://en.wikipedia.org/wiki/Prefix_sum#Algorithm_1:_Shorter_span,_more_parallel

github-actions · 2026-04-09T00:33:19Z

Your PR caused a change in the graphical output of an example or rendering test. This might be intentional, but it could also mean that something broke!
You can review it at https://pixel-eagle.com/project/B04F67C0-C054-4A6F-92EC-F599FEC2FD1D?filter=PR-23662

If it's expected, please add the M-Deliberate-Rendering-Change label.

If this change seems unrelated to your PR, you can consider updating your PR to target the latest main branch, either by rebasing or merging main into it.

laundmo · 2026-05-27T18:03:08Z

AlephCubed found a performance regression with this PR compared to main: #24448 (comment)

pcwalton · 2026-06-23T00:17:16Z

That performance regression isn’t unexpected because that PR used one batch set per drawcall. This isn’t how Bevy’s renderer is intended to be used; batch sets are always expensive no matter what. The bindless version of that PR is the “proper way” to do it and results in huge speedups, as noted there.

This PR should be ready for review again. I’d like to get it in relatively early for the 0.20 cycle. It won’t result in performance improvements on any of our current benchmarks, but it’s infrastructural work that will get harder the longer we wait. It brings us down to zero per-mesh CPU overhead as long as the meshes are in the same batch set, which is important for scaling.

Aceeri · 2026-06-23T20:01:53Z

+        }
+
+        // Save the sum coming out of this block for the next one.
+        sum = output_offsets[WORKGROUP_SIZE - 1u];


does this need a barrier or is the comment above referring to not really needing it?

hillis_steele_scan() does a workgroup barrier after modifying output_offsets, so we don’t need another one here.

I had the same question when I was going through this last night, might be good to add a comment to make it more obvious

OK, I added a comment.

stuartparmenter · 2026-06-24T14:32:06Z

        main_entity: MainEntity,
        input_uniform_index: InputUniformIndex,
    ) {
+        if (input_uniform_index.0 as usize) >= self.mesh_input_uniform_index_to_entity.len() {


Why index this by the global uniform index instead of keeping it dense? Seems like it'd balloon to the max index.

By dense do you mean sparse (like do you mean “why didn’t I use a hash table”?) If so, input uniform indices are allocated sequentially via a free list, so the amount of wasted space should be low in practice.

Sorry, "dense" was the wrong word -- by dense I meant indexed by the local instance position (0..instance_count) like the old render_binned_mesh_instances_cpu — not a hashtable. The free list does keep the global buffer packed, but this Vec is per-batch-set and indexed by the global index, so it's sized to the max index in the set, not its instance count. That's only ~instance_count if the set's members happen to be contiguous — which breaks with many batch sets, or when visibility churn reshuffles the free list across sets, pushing each set's Vec toward the global high water mark.

That’s a good point! I pushed a new commit that makes that change.

kristoff3r · 2026-06-24T20:22:57Z

While I was reviewing this I asked an AI to do a correctness check in the background, and it found a reproducible crash.

Adding this extra entity to the scene() function in 3d_scene:

(
    Mesh3d(asset_value(Cuboid::new(1.0, 1.0, 1.0)))
    MeshMaterial3d::<StandardMaterial>(asset_value(Color::srgb_u8(255, 100, 100)))
    Transform::from_xyz(2.0, 0.5, 0.0)
    NoAutomaticBatching
)

Gives the following backtrace on my machine (Linux/Vulkan):

thread 'Compute Task Pool (4)' (1056485) panicked at crates/bevy_render/src/render_resource/buffer_vec.rs:968:9:
assertion `left == right` failed
  left: 1
 right: 0
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

thread 'Compute Task Pool (4)' (1056485) panicked at crates/bevy_ecs/src/error/handler.rs:141:1:
Encountered an error in system `bevy_render::batching::gpu_preprocessing::batch_and_prepare_binned_render_phase<bevy_pbr::render::light::Shadow, bevy_pbr::render::mesh::MeshPipeline>`: System panicked

Encountered a panic in system `bevy_render::batching::gpu_preprocessing::batch_and_prepare_binned_render_phase<bevy_pbr::render::light::Shadow, bevy_pbr::render::mesh::MeshPipeline>`!

thread '<unnamed>' (1056596) panicked at crates/bevy_ecs/src/error/handler.rs:141:1:
Encountered an error in system `bevy_render::run_render_schedule`: Exclusive system panicked

Encountered a panic in system `bevy_render::run_render_schedule`!

I spent some time trying to understand the reason, but it's a bit complicated. Here is the unedited reason the AI gave me, which seems reasonable to me but of course might be wrong. I hope at least the crash itself is useful.

Merging cpu_metadata (RawBufferVec) and gpu_metadata (UninitBufferVec) into a
single metadata: PartialBufferVec imposes an ordering constraint the two separate buffers never had:
every CPU-initialized push must precede any uninitialized reservation — push_multiple_init asserts
uninit_element_count == 0.

That buffer lives in PhaseIndirectParametersBuffers, which is per-phase, shared across all views, and
cleared once per frame. batch_and_prepare_binned_render_phase processes each view as unbatchables
(allocate → push_multiple_init) → batchables (same) → multidrawables (push_multiple_uninit). So once
view N reserves uninit slots for its multidrawables, view N+1's first allocate calls push_multiple_init
with uninit_element_count > 0 and panics.

before processing multidrawables for any view.

pcwalton · 2026-06-24T22:51:09Z

@kristoff3r Good catch. The problem was that, although we build unbatchables and batchables before multidrawables for a single view, those buffers are in fact shared among all views. I changed the logic so that we prepare unbatchables and batchables for all views before preparing multidrawables for any view, which fixes the problem.

kristoff3r · 2026-06-24T23:05:35Z

+        if (global_id < block_end) {
+            output_offsets[local_id.x] = sum + fan_buffer[global_id];
+        }
+        workgroupBarrier();
+
+        // Perform the scan.
+        hillis_steele_scan(local_id.x);
+
+        // Write the value back.
+        // Note that we don't need a workgroup barrier here because
+        // `hillis_steele_scan` already did one.
+        if (global_id < block_end) {
+            fan_buffer[global_id] = output_offsets[local_id.x];
+        }
+
+        // Save the sum coming out of this block for the next one.
+        sum = output_offsets[WORKGROUP_SIZE - 1u];


This is wrong for more than 64k meshes, because it adds the sum before the scan. Before 64k the sum is always 0, which is why it works for "small" values.

I ran it with a small example locally which produced visible artifacts, and with these changes it seems identical with main.

I also think we technically need a barrier between the loop iterations?

Suggested change

if (global_id < block_end) {

output_offsets[local_id.x] = sum + fan_buffer[global_id];

}

workgroupBarrier();

// Perform the scan.

hillis_steele_scan(local_id.x);

// Write the value back.

// Note that we don't need a workgroup barrier here because

// `hillis_steele_scan` already did one.

if (global_id < block_end) {

fan_buffer[global_id] = output_offsets[local_id.x];

}

// Save the sum coming out of this block for the next one.

sum = output_offsets[WORKGROUP_SIZE - 1u];

if (global_id < block_end) {

output_offsets[local_id.x] = fan_buffer[global_id];

}

workgroupBarrier();

// Perform the scan.

hillis_steele_scan(local_id.x);

// Write the value back.

// Note that we don't need a workgroup barrier here because

// `hillis_steele_scan` already did one.

if (global_id < block_end) {

fan_buffer[global_id] = output_offsets[local_id.x] + sum;

}

// Save the sum coming out of this block for the next one.

sum += output_offsets[WORKGROUP_SIZE - 1u];

workgroupBarrier();

The lack of barrier is fine, we have workgroupBarriers before and hillis_steele_scan should always end with one as well.

Ah yes, you’re right on both counts. The workgroup barrier is necessary to avoid the write-after-read hazard and the sum has to be added in after the Hillis-Steele scan, not before, to avoid propagating the sum value to all the entries.

pcwalton · 2026-06-25T00:41:07Z

New commit should fix the problem.

kristoff3r · 2026-06-25T11:15:36Z

+        }
+
+        // Save the sum coming out of this block for the next one.
+        sum = output_offsets[WORKGROUP_SIZE - 1u];


You overlooked a tiny part of the diff. Since sum is no longer part of output_offsets at this point in the loop, we need to do sum = sum + output_offsets[WORKGROUP_SIZE - 1u];, or equivalently just adding the last number to the current sum.

Suggested change

sum = output_offsets[WORKGROUP_SIZE - 1u];

sum += output_offsets[WORKGROUP_SIZE - 1u];

Oh, sorry, yes, you’re right. Fixed.

Aceeri

Looks good! Some questions/nitpicks for some clarity.

Aceeri · 2026-06-29T12:53:03Z

+            // array<MeshInput>;
            (0, storage_buffer_read_only::<MeshInputUniform>(false)),
+            // @group(0) @binding(1) var<storage> indirect_parameters_metadata:
+            // array<IndirectParametersMetadata>;


I like the marking of group/bindings/type here. Need to start doing this in my own code.

Aceeri · 2026-06-29T13:05:28Z

+    pub first_output_mesh_uniform_index: u32,
+
+    /// Padding.
+    pub pad: [u32; 60],


Why does this need to be 256 byte aligned? Can a uniform not be just 16 bytes? I'm probably missing something here.

It depends on the platform. I forget which one demands 256 bytes—it’s either macOS or WebGPU. But it’s one of them.

@stuartparmenter

`MeshInputUniform` index as the key to index each entity in the multidrawable batch set buffers. As pointed out by @stuartparmenter (thanks!), using the `MeshInputUniform` is wasteful of memory and needlessly complicated.

stuartparmenter

Thanks for addressing comments, looks good to me!

pcwalton requested review from IceSentry, atlv24 and tychedelia April 4, 2026 20:51

pcwalton added the A-Rendering Drawing game state to the screen label Apr 4, 2026

github-project-automation Bot moved this to Needs SME Triage in Rendering Apr 4, 2026

github-project-automation Bot added this to Rendering Apr 4, 2026

pcwalton added S-Needs-Review Needs reviewer attention (from anyone!) to move forward C-Performance A change motivated by improving speed, memory usage or compile times D-Complex Quite challenging from either a design or technical perspective. Ask for help! labels Apr 4, 2026

pcwalton added 2 commits April 8, 2026 17:18

Merge remote-tracking branch 'origin/main' into batch-slabs

6822e6d

Doc check police

fb9b207

Doc check police again

fa25563

cart force-pushed the main branch from af894e5 to 017ffc5 Compare May 4, 2026 23:35

cart closed this May 5, 2026

github-project-automation Bot moved this from Needs SME Triage to Done in Rendering May 5, 2026

cart reopened this May 5, 2026

github-project-automation Bot moved this from Done to Needs SME Triage in Rendering May 5, 2026

laundmo mentioned this pull request May 27, 2026

0.19 Performance Regression #24448

Closed

Merge remote-tracking branch 'origin/main' into batch-slabs

d0b8f2b

Merge remote-tracking branch 'origin/main' into batch-slabs

bf4302f

Aceeri reviewed Jun 23, 2026

View reviewed changes

stuartparmenter reviewed Jun 24, 2026

View reviewed changes

pcwalton requested a review from Aceeri June 24, 2026 18:49

Add comment noting that a workgroup barrier isn't needed

b332040

pcwalton requested a review from stuartparmenter June 24, 2026 19:39

Merge remote-tracking branch 'origin/main' into batch-slabs

527a8c6

Avoid a crash by processing unbatchables and batchables for *all* views

6ab6ee1

before processing multidrawables for any view.

kristoff3r reviewed Jun 24, 2026

View reviewed changes

Fix incorrect logic in allocate_global_scan.

43bfe79

pcwalton requested a review from kristoff3r June 25, 2026 01:18

kristoff3r reviewed Jun 25, 2026

View reviewed changes

Fix incorrect sum line

f6cf62e

pcwalton requested a review from kristoff3r June 25, 2026 16:30

Aceeri approved these changes Jun 29, 2026

View reviewed changes

pcwalton added 4 commits June 29, 2026 20:04

Merge remote-tracking branch 'origin/main' into batch-slabs

f557ea9

Address review comment

5bd16b9

Use the RenderBinnedMeshInstanceIndex instead of the

8ab0a3a

`MeshInputUniform` index as the key to index each entity in the multidrawable batch set buffers. As pointed out by @stuartparmenter (thanks!), using the `MeshInputUniform` is wasteful of memory and needlessly complicated.

Fix obsolete comment

f0bb715

stuartparmenter approved these changes Jun 30, 2026

View reviewed changes

pcwalton added S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it and removed S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels Jun 30, 2026

	sum = output_offsets[WORKGROUP_SIZE - 1u];
	sum += output_offsets[WORKGROUP_SIZE - 1u];

Uh oh!

Uh oh!

Conversation

pcwalton commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Apr 9, 2026

Uh oh!

laundmo commented May 27, 2026

Uh oh!

pcwalton commented Jun 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pcwalton Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stuartparmenter Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kristoff3r commented Jun 24, 2026

Uh oh!

pcwalton commented Jun 24, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pcwalton commented Jun 25, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Aceeri left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stuartparmenter left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

pcwalton commented Apr 4, 2026 •

edited

Loading

pcwalton Jun 24, 2026 •

edited

Loading

stuartparmenter Jun 25, 2026 •

edited

Loading