Skip to content

Move MeshUniform allocation from the CPU to the GPU.#23662

Open
pcwalton wants to merge 15 commits into
bevyengine:mainfrom
pcwalton:batch-slabs
Open

Move MeshUniform allocation from the CPU to the GPU.#23662
pcwalton wants to merge 15 commits into
bevyengine:mainfrom
pcwalton:batch-slabs

Conversation

@pcwalton

@pcwalton pcwalton commented Apr 4, 2026

Copy link
Copy Markdown
Contributor

The goal of GPU-driven rendering is to cache the entire scene graph on the GPU in a form that's efficient for rendering and, for objects that didn't change since the previous frame, to have zero CPU-side overhead. If the scene didn't change, the only CPU overhead should be proportional to the number of multi-draw indirect calls. PR #23481 eliminated the CPU loop over every mesh instance in rendering, which brought us closer to this ideal, but it didn't fully get us there, because there's still a CPU loop over every mesh. Although there are usually many fewer meshes than mesh instances in large scenes, this still represents a potential bottleneck on complex scenes and/or on lower-end hardware.

This CPU loop exists to allocate MeshUniforms, which are the data structures that the GPU transform-and-cull stage stores the post-transform data in. Unlike MeshInputUniforms, which are scattered throughout memory and allocated using a CPU-side free list, MeshUniforms are indexed by instance ID. Because of the way multi-draw indirect assigns instance IDs, all instances of a specific mesh must be adjacent to one another. This necessitates a global allocation pass that lays out MeshUniforms in memory such that all the instances of a specific mesh end up adjacent to one another. This operation is currently performed on the CPU in the MultidrawableBatchSetPreparer::prepare_multidrawable_binned_batch_set method and has overhead proportional to the number of separate meshes (not mesh instances) in each batch set.

This PR addresses the problem by moving the sequential loop in that method to the GPU. A new GPU phase known as the uniform allocation step has been added. This shader essentially performs a prefix sum in order to allocate the MeshUniforms corresponding to the batches within a batch set. This isn't the first prefix sum operation that we have in Bevy: PR #23036 added a prefix sum for light clustering. However, in order to scale better to tens of thousands of meshes in a single batch set (i.e. multi-draw command), the uniform allocation pass added in this PR uses the three-step scan and fan process rather than the two-step process that PR #23036 uses. The scan and fan algorithm works as follows:

  1. Local allocation: Perform a Hillis-Steele scan on chunks of size equal to the workgroup size (256, in this case), producing a prefix sum for each 256-element block. Write the final sum of each chunk to a fan buffer.

  2. Global allocation: Perform a Hillis-Steele scan on the fan buffer and write the results. Now each chunk can determine the running total leading into that chunk.

  3. Fan: For each chunk, add the running total leading into that chunk to every one of that chunk's elements.

Note that, if the number of meshes is lower than the workgroup size, we only need step (1) above and can skip steps (2) and (3). Because batch sets rarely contain over 256 meshes, this means that in real-world scenes we typically only need to run step (1).

This patch had to rework the RenderMultidrawableBatchSet structure added in PR #23481 in order to perform additional bookkeeping necessary to keep the time complexity of adding a mesh instance O(1). The proptest-based test suite has been updated and extended significantly to deal with this additional complexity.

For static meshes without skins and morph targets, this PR eliminates the last remaining per-mesh overhead in the render schedules, with the exceptions of (a) the full ECS table scans required for change detection and (b) the overhead of reuploading the various GPU buffers. Change indexes (PR #23519) address issue (a), and more use of SparseBufferVec (PR #23242) will address issue (b).

The goal of GPU-driven rendering is to cache the entire scene graph on
the GPU in a form that's efficient for rendering and, for objects that
didn't change since the previous frame, to have zero CPU-side overhead.
If the scene didn't change, the only CPU overhead should be proportional
to the number of multi-draw indirect calls. PR bevyengine#23481 eliminated the CPU
loop over every mesh *instance* in rendering, which brought us closer to
this ideal, but it didn't fully get us there, because there's still a
CPU loop over every *mesh*. Although there are usually many fewer meshes
than mesh instances in large scenes, this still represents a potential
bottleneck on complex scenes and/or on lower-end hardware.

This CPU loop exists to allocate `MeshUniform`s, which are the data
structures that the GPU transform-and-cull stage stores the
post-transform data in. Unlike `MeshInputUniform`s, which are scattered
throughout memory and allocated using a CPU-side free list,
`MeshUniform`s are indexed by *instance ID*. Because of the way
multi-draw indirect assigns instance IDs, all instances of a specific
mesh must be adjacent to one another. This necessitates a global
allocation pass that lays out `MeshUniform`s in memory such that all the
instances of a specific mesh end up adjacent to one another. This
operation is currently performed on the CPU in the
`MultidrawableBatchSetPreparer::prepare_multidrawable_binned_batch_set`
method and has overhead proportional to the number of separate meshes
(not mesh *instances*) in each batch set.

This PR addresses the problem by moving the sequential loop in that
method to the GPU. A new GPU phase known as the *uniform allocation*
step has been added. This shader essentially performs a [prefix sum] in
order to allocate the `MeshUniform`s corresponding to the batches within
a batch set. This isn't the first prefix sum operation that we have in
Bevy: PR bevyengine#23036 added a prefix sum for light clustering. However, in
order to scale better to tens of thousands of meshes in a single batch
set (i.e. multi-draw command), the uniform allocation pass added in this
PR uses the three-step *scan and fan* process rather than the two-step
process that PR bevyengine#23036 uses.  The scan and fan algorithm works as
follows:

1. *Local allocation*: Perform a [Hillis-Steele scan] on chunks of size
   equal to the workgroup size (256, in this case), producing a prefix
   sum for each 256-element block. Write the final sum of each chunk to
   a *fan buffer*.

2. *Global allocation*: Perform a Hillis-Steele scan on the fan buffer
   and write the results. Now each chunk can determine the running total
   leading into that chunk.

3. *Fan*: For each chunk, add the running total leading into that chunk
   to every one of that chunk's elements.

Note that, if the number of meshes is lower than the workgroup size, we
only need step (1) above and can skip steps (2) and (3). Because batch
sets rarely contain over 256 meshes, this means that in real-world
scenes we typically only need to run step (1).

This patch had to rework the `RenderMultidrawableBatchSet` structure
added in PR bevyengine#23481 in order to perform additional bookkeeping necessary
to keep the time complexity of adding a mesh instance O(1). The
`proptest`-based test suite has been updated and extended significantly
to deal with this additional complexity.

For static meshes without skins and morph target, this PR eliminates the
last remaining per-mesh overhead in the render schedules, with the
exceptions of (a) the full ECS table scans required for change detection
and (b) the overhead of reuploading the various GPU buffers. Change
indexes (PR bevyengine#23519) address issue (a), and more use of `SparseBufferVec`
(PR bevyengine#23242) will address issue (b).

[prefix sum]: https://en.wikipedia.org/wiki/Prefix_sum

[Hillis-Steele scan]: https://en.wikipedia.org/wiki/Prefix_sum#Algorithm_1:_Shorter_span,_more_parallel
@pcwalton pcwalton added the A-Rendering Drawing game state to the screen label Apr 4, 2026
@github-project-automation github-project-automation Bot moved this to Needs SME Triage in Rendering Apr 4, 2026
@pcwalton pcwalton added S-Needs-Review Needs reviewer attention (from anyone!) to move forward C-Performance A change motivated by improving speed, memory usage or compile times D-Complex Quite challenging from either a design or technical perspective. Ask for help! labels Apr 4, 2026
@github-actions

github-actions Bot commented Apr 9, 2026

Copy link
Copy Markdown
Contributor

Your PR caused a change in the graphical output of an example or rendering test. This might be intentional, but it could also mean that something broke!
You can review it at https://pixel-eagle.com/project/B04F67C0-C054-4A6F-92EC-F599FEC2FD1D?filter=PR-23662

If it's expected, please add the M-Deliberate-Rendering-Change label.

If this change seems unrelated to your PR, you can consider updating your PR to target the latest main branch, either by rebasing or merging main into it.

@cart cart closed this May 5, 2026
@github-project-automation github-project-automation Bot moved this from Needs SME Triage to Done in Rendering May 5, 2026
@cart cart reopened this May 5, 2026
@github-project-automation github-project-automation Bot moved this from Done to Needs SME Triage in Rendering May 5, 2026
@laundmo

laundmo commented May 27, 2026

Copy link
Copy Markdown
Member

AlephCubed found a performance regression with this PR compared to main: #24448 (comment)

@pcwalton

Copy link
Copy Markdown
Contributor Author

That performance regression isn’t unexpected because that PR used one batch set per drawcall. This isn’t how Bevy’s renderer is intended to be used; batch sets are always expensive no matter what. The bindless version of that PR is the “proper way” to do it and results in huge speedups, as noted there.

This PR should be ready for review again. I’d like to get it in relatively early for the 0.20 cycle. It won’t result in performance improvements on any of our current benchmarks, but it’s infrastructural work that will get harder the longer we wait. It brings us down to zero per-mesh CPU overhead as long as the meshes are in the same batch set, which is important for scaling.

}

// Save the sum coming out of this block for the next one.
sum = output_offsets[WORKGROUP_SIZE - 1u];

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need a barrier or is the comment above referring to not really needing it?

@pcwalton pcwalton Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hillis_steele_scan() does a workgroup barrier after modifying output_offsets, so we don’t need another one here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same question when I was going through this last night, might be good to add a comment to make it more obvious

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I added a comment.

main_entity: MainEntity,
input_uniform_index: InputUniformIndex,
) {
if (input_uniform_index.0 as usize) >= self.mesh_input_uniform_index_to_entity.len() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why index this by the global uniform index instead of keeping it dense? Seems like it'd balloon to the max index.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By dense do you mean sparse (like do you mean “why didn’t I use a hash table”?) If so, input uniform indices are allocated sequentially via a free list, so the amount of wasted space should be low in practice.

@stuartparmenter stuartparmenter Jun 25, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, "dense" was the wrong word -- by dense I meant indexed by the local instance position (0..instance_count) like the old render_binned_mesh_instances_cpu — not a hashtable. The free list does keep the global buffer packed, but this Vec is per-batch-set and indexed by the global index, so it's sized to the max index in the set, not its instance count. That's only ~instance_count if the set's members happen to be contiguous — which breaks with many batch sets, or when visibility churn reshuffles the free list across sets, pushing each set's Vec toward the global high water mark.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s a good point! I pushed a new commit that makes that change.

Comment thread crates/bevy_render/src/batching/gpu_preprocessing.rs
@pcwalton pcwalton requested a review from Aceeri June 24, 2026 18:49
@pcwalton pcwalton requested a review from stuartparmenter June 24, 2026 19:39
@kristoff3r

Copy link
Copy Markdown
Contributor

While I was reviewing this I asked an AI to do a correctness check in the background, and it found a reproducible crash.

Adding this extra entity to the scene() function in 3d_scene:

(
    Mesh3d(asset_value(Cuboid::new(1.0, 1.0, 1.0)))
    MeshMaterial3d::<StandardMaterial>(asset_value(Color::srgb_u8(255, 100, 100)))
    Transform::from_xyz(2.0, 0.5, 0.0)
    NoAutomaticBatching
)

Gives the following backtrace on my machine (Linux/Vulkan):

thread 'Compute Task Pool (4)' (1056485) panicked at crates/bevy_render/src/render_resource/buffer_vec.rs:968:9:
assertion `left == right` failed
  left: 1
 right: 0
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

thread 'Compute Task Pool (4)' (1056485) panicked at crates/bevy_ecs/src/error/handler.rs:141:1:
Encountered an error in system `bevy_render::batching::gpu_preprocessing::batch_and_prepare_binned_render_phase<bevy_pbr::render::light::Shadow, bevy_pbr::render::mesh::MeshPipeline>`: System panicked

Encountered a panic in system `bevy_render::batching::gpu_preprocessing::batch_and_prepare_binned_render_phase<bevy_pbr::render::light::Shadow, bevy_pbr::render::mesh::MeshPipeline>`!

thread '<unnamed>' (1056596) panicked at crates/bevy_ecs/src/error/handler.rs:141:1:
Encountered an error in system `bevy_render::run_render_schedule`: Exclusive system panicked

Encountered a panic in system `bevy_render::run_render_schedule`!

I spent some time trying to understand the reason, but it's a bit complicated. Here is the unedited reason the AI gave me, which seems reasonable to me but of course might be wrong. I hope at least the crash itself is useful.

Merging cpu_metadata (RawBufferVec) and gpu_metadata (UninitBufferVec) into a
single metadata: PartialBufferVec imposes an ordering constraint the two separate buffers never had:
every CPU-initialized push must precede any uninitialized reservation — push_multiple_init asserts
uninit_element_count == 0.

That buffer lives in PhaseIndirectParametersBuffers, which is per-phase, shared across all views, and
cleared once per frame. batch_and_prepare_binned_render_phase processes each view as unbatchables
(allocatepush_multiple_init) → batchables (same) → multidrawables (push_multiple_uninit). So once
view N reserves uninit slots for its multidrawables, view N+1's first allocate calls push_multiple_init
with uninit_element_count > 0 and panics.

before processing multidrawables for any view.
@pcwalton

Copy link
Copy Markdown
Contributor Author

@kristoff3r Good catch. The problem was that, although we build unbatchables and batchables before multidrawables for a single view, those buffers are in fact shared among all views. I changed the logic so that we prepare unbatchables and batchables for all views before preparing multidrawables for any view, which fixes the problem.

Comment on lines +163 to +179
if (global_id < block_end) {
output_offsets[local_id.x] = sum + fan_buffer[global_id];
}
workgroupBarrier();

// Perform the scan.
hillis_steele_scan(local_id.x);

// Write the value back.
// Note that we don't need a workgroup barrier here because
// `hillis_steele_scan` already did one.
if (global_id < block_end) {
fan_buffer[global_id] = output_offsets[local_id.x];
}

// Save the sum coming out of this block for the next one.
sum = output_offsets[WORKGROUP_SIZE - 1u];

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong for more than 64k meshes, because it adds the sum before the scan. Before 64k the sum is always 0, which is why it works for "small" values.

I ran it with a small example locally which produced visible artifacts, and with these changes it seems identical with main.

I also think we technically need a barrier between the loop iterations?

Suggested change
if (global_id < block_end) {
output_offsets[local_id.x] = sum + fan_buffer[global_id];
}
workgroupBarrier();
// Perform the scan.
hillis_steele_scan(local_id.x);
// Write the value back.
// Note that we don't need a workgroup barrier here because
// `hillis_steele_scan` already did one.
if (global_id < block_end) {
fan_buffer[global_id] = output_offsets[local_id.x];
}
// Save the sum coming out of this block for the next one.
sum = output_offsets[WORKGROUP_SIZE - 1u];
if (global_id < block_end) {
output_offsets[local_id.x] = fan_buffer[global_id];
}
workgroupBarrier();
// Perform the scan.
hillis_steele_scan(local_id.x);
// Write the value back.
// Note that we don't need a workgroup barrier here because
// `hillis_steele_scan` already did one.
if (global_id < block_end) {
fan_buffer[global_id] = output_offsets[local_id.x] + sum;
}
// Save the sum coming out of this block for the next one.
sum += output_offsets[WORKGROUP_SIZE - 1u];
workgroupBarrier();

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lack of barrier is fine, we have workgroupBarriers before and hillis_steele_scan should always end with one as well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, you’re right on both counts. The workgroup barrier is necessary to avoid the write-after-read hazard and the sum has to be added in after the Hillis-Steele scan, not before, to avoid propagating the sum value to all the entries.

@pcwalton

Copy link
Copy Markdown
Contributor Author

New commit should fix the problem.

@pcwalton pcwalton requested a review from kristoff3r June 25, 2026 01:18
}

// Save the sum coming out of this block for the next one.
sum = output_offsets[WORKGROUP_SIZE - 1u];

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You overlooked a tiny part of the diff. Since sum is no longer part of output_offsets at this point in the loop, we need to do sum = sum + output_offsets[WORKGROUP_SIZE - 1u];, or equivalently just adding the last number to the current sum.

Suggested change
sum = output_offsets[WORKGROUP_SIZE - 1u];
sum += output_offsets[WORKGROUP_SIZE - 1u];

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, yes, you’re right. Fixed.

@pcwalton pcwalton requested a review from kristoff3r June 25, 2026 16:30

@Aceeri Aceeri left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Some questions/nitpicks for some clarity.

Comment thread crates/bevy_pbr/src/render/gpu_preprocess.rs Outdated
// array<MeshInput>;
(0, storage_buffer_read_only::<MeshInputUniform>(false)),
// @group(0) @binding(1) var<storage> indirect_parameters_metadata:
// array<IndirectParametersMetadata>;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the marking of group/bindings/type here. Need to start doing this in my own code.

pub first_output_mesh_uniform_index: u32,

/// Padding.
pub pad: [u32; 60],

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be 256 byte aligned? Can a uniform not be just 16 bytes? I'm probably missing something here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on the platform. I forget which one demands 256 bytes—it’s either macOS or WebGPU. But it’s one of them.

pcwalton added 4 commits June 29, 2026 20:04
`MeshInputUniform` index as the key to index each entity in the
multidrawable batch set buffers.

As pointed out by @stuartparmenter (thanks!), using the
`MeshInputUniform` is wasteful of memory and needlessly complicated.

@stuartparmenter stuartparmenter left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing comments, looks good to me!

@pcwalton pcwalton added S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it and removed S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times D-Complex Quite challenging from either a design or technical perspective. Ask for help! S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it

Projects

Status: Needs SME Triage

Development

Successfully merging this pull request may close these issues.

6 participants