Split ShardRam EC tree into a dedicated circuit#1369
Open
hero78119 wants to merge 3 commits into
Open
Conversation
…cuit # Conflicts: # ceno_zkvm/src/scheme/cpu/mod.rs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
ShardRam currently mixes the leaf RAM/Poseidon work and the EC accumulation tree in one circuit. That keeps the large Poseidon-heavy leaf witness on a 2n domain even though the EC tree is the part that naturally needs the binary-tree 2n layout.
This PR splits the EC tree into a separate
ShardRamEcTreeCircuitand connects the leaf and EC-tree chips through a compact custom RAM record.Design Rationale
Golden rules for chip splitting:
The split keeps the ShardRam leaf on an n-sized domain and moves only the EC tree into a dedicated 2n chip. The chip boundary is connected with
RAMType::Customrecords carrying the EC point(x[0..7], y[0..7]).Soundness-sensitive points:
ShardRamEcTreeCircuitconsumes the same EC point records and proves the EC accumulation tree separately.Trade-off: this reduces cached raw ShardRam witness footprint, but introduces an extra chip and custom read/write product argument. In the current Reth benchmark shape, the saved resident witness is not the dominant e2e bottleneck.
Change Highlights
ceno_zkvm/src/tables/shard_ram.rsShardRamEcTreeCircuitfromShardRamCircuit.ShardRamEcPoint + x + y.config.perm_config.p3_cols[0].idinstead of the old hardcoded offset.ceno_zkvm/src/instructions/gpu/chips/shard_ram.rsceno-gpu/cpp/common/witgen/shard_ram_per_row.cuhBenchmark / Performance Impact
Operation
Block:
23587691, full shards,CENO_GPU_WITGEN=0,CENO_GPU_CACHE_LEVEL=1, GPU enabled.-1.02x-1.06x1.01x1.02xStructured metric note: the JSON
create_proof_of_shard_time_mssample was2568mson master and2542mson this PR, but the span log is the clearer full-shard comparison because it reports both shard spans.Layer
-1.64xscheduler reservation1.83xresident raw witness reduction2.00xDetailed memtrack from this PR, shard 0:
Interpretation:
378 MiB -> 206.5 MiB, roughly171.5 MiBsaved.72.52 MiB) remains inShardRamEcTreeCircuit.cache=1, the schedulerresident=estimate does not include retained raw witness device backing, so it should not be used alone to judge the saved Poseidon-column footprint.Benchmark command(s):
Environment:
rustc 1.93.0-nightly (07bdbaedc 2025-11-19)feat/shardram_circuit8162e6f45a53226a93bbf05bd03fd9edb163d53d678910c71624ab69ea776a82f3ec99971cc3e6d9Raw data:
metrics_23587691_full_upstream_master_witgen0_cache1_h23_maxcell6_localcenogpu_gkrpath.jsonsanity_23587691_full_upstream_master_witgen0_cache1_h23_maxcell6_localcenogpu_gkrpath.logmetrics_23587691_full_witgen0_cache1_shardram_split_current_gkrpatch_20260623.jsonsanity_23587691_full_witgen0_cache1_shardram_split_current_gkrpatch_20260623.logsanity_23587691_shard0_memtrack_shardram_split_current_20260623.logCI Benchmark Comparison: Reth Block 23817600
Comparison source:
Both runs passed. No artifacts were published, so the metrics below were extracted from the GitHub Actions log archive
1_benchmark.txt.feat/opt#1e2ff5bfeat/shardram_circuit#de5cc96cfeat/opt_sc_first_round#c61c925feat/shardram_circuit#134576c423817600238176001313reth-block72.7s69.8s1.04xfastercreate_proof_of_shard52.86s50.03s1.06xfasterapp.verify1.97s1.96s1.01xfastergenerate_witness40.03s39.22s1.02xfaster24,899,592 bytes,23.75 MiB24,976,657 bytes,23.82 MiB-1.00xlarger710723-1.02x, exactly +1 per shardPer-shard
create_proof_of_shardspans:4.64s4.49s1.03xfaster4.08s4.06s1.00xfaster4.30s4.18s1.03xfaster4.60s4.39s1.05xfaster4.57s4.28s1.07xfaster4.34s4.23s1.03xfaster4.20s3.89s1.08xfaster4.88s4.68s1.04xfaster4.21s3.78s1.11xfaster3.99s3.64s1.10xfaster4.13s3.77s1.10xfaster3.76s3.42s1.10xfaster1.16s1.22s-1.05xslowerInterpretation:
reth-blockimproves from72.7sto69.8s, or1.04xfaster.create_proof_of_shardspans are1.06xfaster, almost matching thereth-blockgain.ShardRamEcTreeCircuit.ShardRamCircuit estimated=170.52MB; feature hasShardRamCircuit estimated=104.39MBplusShardRamEcTreeCircuit estimated=168.52MB. The ShardRam leaf got smaller, while the new EC-tree chip adds work.Testing
Outcomes:
keccak_syscalle2e: passed.keccak_syscalle2e: passed.exit code 0. Success.Risks and Rollout
cache=1, witgen=0shape. The extra chip and custom product argument can offset the raw witness saving.Follow-ups (optional)
Copilot Reviewer Directive (keep this section)
When Copilot reviews this PR, apply
.github/copilot-instructions.mdstrictly.