Split ShardRam EC tree into a dedicated circuit by hero78119 · Pull Request #1369 · scroll-tech/ceno

hero78119 · 2026-06-23T12:43:53Z

Problem

ShardRam currently mixes the leaf RAM/Poseidon work and the EC accumulation tree in one circuit. That keeps the large Poseidon-heavy leaf witness on a 2n domain even though the EC tree is the part that naturally needs the binary-tree 2n layout.

This PR splits the EC tree into a separate ShardRamEcTreeCircuit and connects the leaf and EC-tree chips through a compact custom RAM record.

Design Rationale

Golden rules for chip splitting:

Best case: trade a smaller lookup domain, especially across a power-of-two boundary, for an extra product read/write argument. This is the highest-value split because it can reduce the dominant lookup proving work.
Second-best case: trade smaller resident memory for an extra product read/write argument. This is still useful when cached witness/device residency is the bottleneck, but it may not improve e2e time if the added chip/product work dominates.

The split keeps the ShardRam leaf on an n-sized domain and moves only the EC tree into a dedicated 2n chip. The chip boundary is connected with RAMType::Custom records carrying the EC point (x[0..7], y[0..7]).

Soundness-sensitive points:

The ShardRam leaf still computes/binds the Poseidon-derived x-coordinate and y-sign/range constraints.
ShardRamEcTreeCircuit consumes the same EC point records and proves the EC accumulation tree separately.
The custom bridge rows are checked in tests so active leaf reads/writes match EC-tree writes/reads.
Padding/custom rows use neutral values where required; the new RAM custom read/write padding is one.

Trade-off: this reduces cached raw ShardRam witness footprint, but introduces an extra chip and custom read/write product argument. In the current Reth benchmark shape, the saved resident witness is not the dominant e2e bottleneck.

Change Highlights

ceno_zkvm/src/tables/shard_ram.rs
- Split ShardRamEcTreeCircuit from ShardRamCircuit.
- Compact the custom bridge record to ShardRamEcPoint + x + y.
- Remove duplicated RAM/Poseidon fields from EC-tree.
- Fix CPU Poseidon witness assignment to use config.perm_config.p3_cols[0].id instead of the old hardcoded offset.
- Add focused selector/padding/custom-record tests.
ceno_zkvm/src/instructions/gpu/chips/shard_ram.rs
- Update GPU column maps for the split leaf and EC-tree layouts.
ceno-gpu/cpp/common/witgen/shard_ram_per_row.cuh
- Update EC-tree witness generation to write only x/y plus structural selector data.

Benchmark / Performance Impact

Operation

Block: 23587691, full shards, CENO_GPU_WITGEN=0, CENO_GPU_CACHE_LEVEL=1, GPU enabled.

Operation	master (s)	this PR (s)	Ratio (master -> this PR)
reth-block	14.153	14.440	`-1.02x`
create_proof_of_shard, shard 0 span	4.240	4.480	`-1.06x`
create_proof_of_shard, shard 1 span	2.570	2.540	`1.01x`
app.verify	0.266	0.261	`1.02x`

Structured metric note: the JSON create_proof_of_shard_time_ms sample was 2568ms on master and 2542ms on this PR, but the span log is the clearer full-shard comparison because it reports both shard spans.

Layer

Layer / Memory item	master	this PR	Ratio (master -> this PR)
ShardRam scheduled proof reservation	92.00 MiB	61.50 MiB leaf + 89.52 MiB EC-tree	`-1.64x` scheduler reservation
ShardRam raw cached witness estimate	~378 MiB	~206.5 MiB	`1.83x` resident raw witness reduction
ShardRam leaf rows	262144	131072	`2.00x`
ShardRam leaf witness columns	~378	371	leaf no longer includes EC slope/tree columns
ShardRamEcTree rows	included in baseline ShardRam 2n layout	262144	moved to separate chip

Detailed memtrack from this PR, shard 0:

Circuit	rows	witness columns	structural columns	resident	main witness	tower prove	ecc	total scheduler estimate
ShardRamCircuit	131072	371	3	1.50 MiB	16.00 MiB	45.89 MiB	0.00 MiB	61.50 MiB
ShardRamEcTreeCircuit	262144	21	7	7.00 MiB	8.00 MiB	19.89 MiB	72.52 MiB	89.52 MiB

Interpretation:

The intended resident raw-witness reduction is present: about 378 MiB -> 206.5 MiB, roughly 171.5 MiB saved.
The current e2e time does not improve because scheduler proof reservation is now split into two chips, and the EC quark allocation (72.52 MiB) remains in ShardRamEcTreeCircuit.
With cache=1, the scheduler resident= estimate does not include retained raw witness device backing, so it should not be used alone to judge the saved Poseidon-column footprint.

Benchmark command(s):

# master baseline and this PR used the same Reth shape:
CENO_GPU_WITGEN=0 \
CENO_CONCURRENT_CHIP_PROVING=1 \
CENO_GPU_CACHE_LEVEL=1 \
CENO_GPU_JAGGED_RESHAPE_LOG_HEIGHT=23 \
CENO_MAX_CELL_PER_SHARD=805306368 \
CENO_GPU_MEM_TRACKING=0 \
CENO_GPU_LARGE_TASK_BOOKING_MARGIN_MB=0 \
OUTPUT_PATH=<metrics-json> \
RUST_LOG=info \
cargo run --features 'jemalloc,gpu' --bin ceno-reth-benchmark-bin --release \
  --config 'patch."https://github.com/scroll-tech/ceno.git".ceno_emul.path="../ceno/ceno_emul"' \
  --config 'patch."https://github.com/scroll-tech/ceno.git".ceno_host.path="../ceno/ceno_host"' \
  --config 'patch."https://github.com/scroll-tech/ceno.git".ceno_zkvm.path="../ceno/ceno_zkvm"' \
  --config 'patch."https://github.com/scroll-tech/ceno-gpu-mock.git".ceno_gpu.path="../ceno-gpu/cuda_hal"' \
  -- \
  --block-number 23587691 \
  --chain-id 1 \
  --cache-dir block_data \
  --mode prove-app \
  --app-proofs ../ceno/app_proof.bitcode

# Extra diagnostic run for this PR, shard 0 only:
CENO_GPU_MEM_TRACKING=1 ... --shard-id 0

Environment:

CPU: AMD Ryzen 9 5900XT 16-Core Processor, 32 logical CPUs
GPU: NVIDIA GeForce RTX 5070 Ti, 16303 MiB, driver 570.172.08
Rust: rustc 1.93.0-nightly (07bdbaedc 2025-11-19)
Branch: feat/shardram_circuit
This PR commit tested locally: 8162e6f45a53226a93bbf05bd03fd9edb163d53d
Baseline master commit tested locally: 678910c71624ab69ea776a82f3ec99971cc3e6d9

Raw data:

master:
- metrics_23587691_full_upstream_master_witgen0_cache1_h23_maxcell6_localcenogpu_gkrpath.json
- sanity_23587691_full_upstream_master_witgen0_cache1_h23_maxcell6_localcenogpu_gkrpath.log
this PR:
- metrics_23587691_full_witgen0_cache1_shardram_split_current_gkrpatch_20260623.json
- sanity_23587691_full_witgen0_cache1_shardram_split_current_gkrpatch_20260623.log
- memtrack diagnostic: sanity_23587691_shard0_memtrack_shardram_split_current_20260623.log

CI Benchmark Comparison: Reth Block 23817600

Comparison source:

Feature run: https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/28036220554
Baseline run: https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/27939140577

Both runs passed. No artifacts were published, so the metrics below were extracted from the GitHub Actions log archive 1_benchmark.txt.

Metric	Baseline	Feature	Ratio
Ceno ref	`feat/opt#1e2ff5b`	`feat/shardram_circuit#de5cc96c`	changed
Ceno-GPU ref	`feat/opt_sc_first_round#c61c925`	`feat/shardram_circuit#134576c4`	changed
Block	`23817600`	`23817600`	same
Shards	`13`	`13`	same
`reth-block`	`72.7s`	`69.8s`	`1.04x` faster
sum `create_proof_of_shard`	`52.86s`	`50.03s`	`1.06x` faster
`app.verify`	`1.97s`	`1.96s`	`1.01x` faster
sum `generate_witness`	`40.03s`	`39.22s`	`1.02x` faster
proof size	`24,899,592 bytes`, `23.75 MiB`	`24,976,657 bytes`, `23.82 MiB`	`-1.00x` larger
total verifier chip groups	`710`	`723`	`-1.02x`, exactly +1 per shard

Per-shard create_proof_of_shard spans:

Shard	Baseline	Feature	Ratio
0	`4.64s`	`4.49s`	`1.03x` faster
1	`4.08s`	`4.06s`	`1.00x` faster
2	`4.30s`	`4.18s`	`1.03x` faster
3	`4.60s`	`4.39s`	`1.05x` faster
4	`4.57s`	`4.28s`	`1.07x` faster
5	`4.34s`	`4.23s`	`1.03x` faster
6	`4.20s`	`3.89s`	`1.08x` faster
7	`4.88s`	`4.68s`	`1.04x` faster
8	`4.21s`	`3.78s`	`1.11x` faster
9	`3.99s`	`3.64s`	`1.10x` faster
10	`4.13s`	`3.77s`	`1.10x` faster
11	`3.76s`	`3.42s`	`1.10x` faster
12	`1.16s`	`1.22s`	`-1.05x` slower

Interpretation:

The CI e2e improvement is real for this feature-branch comparison: reth-block improves from 72.7s to 69.8s, or 1.04x faster.
The improvement comes mainly from shard proving. The summed create_proof_of_shard spans are 1.06x faster, almost matching the reth-block gain.
Verification is flat and proof size is slightly larger, so the win is not from smaller proof output or fewer verifier chip groups.
The feature adds one chip group per shard, consistent with the new ShardRamEcTreeCircuit.
The logs show the expected split effect: baseline has ShardRamCircuit estimated=170.52MB; feature has ShardRamCircuit estimated=104.39MB plus ShardRamEcTreeCircuit estimated=168.52MB. The ShardRam leaf got smaller, while the new EC-tree chip adds work.
Caveat: this is not a pure ShardRam split A/B. The feature run also switches Ceno, Ceno-GPU, and GKR refs. The defensible conclusion is that the feature branch improves e2e because shard proving is faster across most shards despite a slightly larger proof and one extra chip per shard. This CI alone does not isolate how much of the improvement comes from the split itself versus the newer dependency set.

Testing

cargo test --config net.git-fetch-with-cli=true --package ceno_zkvm --lib \
  tables::shard_ram::tests::test_shard_ram_split_selectors_and_tower_padding -- --nocapture

cargo test --config net.git-fetch-with-cli=true --package ceno_zkvm --lib \
  tables::shard_ram::tests::test_shard_ram_circuit -- --nocapture

cargo run --config net.git-fetch-with-cli=true --release --package ceno_zkvm --bin e2e -- \
  --platform=ceno \
  --max-cycle-per-shard=1600 \
  examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall

cargo run --config net.git-fetch-with-cli=true \
  --config 'patch."https://github.com/scroll-tech/ceno-gpu-mock.git".ceno_gpu.path="../ceno-gpu/cuda_hal"' \
  --release --package ceno_zkvm --features gpu --bin e2e -- \
  --platform=ceno \
  --max-cycle-per-shard=1600 \
  examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall

# Reth full-shard GPU validation, cache=1/witgen=0/full shards, command shape shown above.

Outcomes:

ShardRam selector/padding/custom-record test: passed.
ShardRam circuit test: passed.
CPU keccak_syscall e2e: passed.
GPU keccak_syscall e2e: passed.
Reth full-shard GPU validation: passed, final exit code 0. Success.

Risks and Rollout

Soundness risk is concentrated in the new custom bridge between leaf and EC-tree. This is covered by active-row custom read/write matching tests and CPU/GPU e2e validation.
Performance risk: this split reduces cached raw witness VRAM but does not currently improve Reth e2e proof time under the tested cache=1, witgen=0 shape. The extra chip and custom product argument can offset the raw witness saving.
Rollback is local to the ShardRam split: revert the EC-tree chip split and restore the single-chip ShardRam layout.

Follow-ups (optional)

Add first-class metrics for retained raw witness device backing so cache=1 resident savings are visible directly in benchmark output.
Investigate whether ShardRam leaf tower prove can be reduced by shrinking the custom bridge product argument or avoiding materialized main-witness outputs that are not needed across the full tower prove stage.
Investigate scheduling policy so the extra EC-tree chip does not erase the resident-witness benefit in end-to-end proof time.

Copilot Reviewer Directive (keep this section)

When Copilot reviews this PR, apply .github/copilot-instructions.md strictly.

…cuit # Conflicts: # ceno_zkvm/src/scheme/cpu/mod.rs

hero78119 added 2 commits June 23, 2026 18:41

Split ShardRam EC tree circuit

bc27512

Merge remote-tracking branch 'upstream/master' into feat/shardram_cir…

8162e6f

…cuit # Conflicts: # ceno_zkvm/src/scheme/cpu/mod.rs

hero78119 marked this pull request as draft June 23, 2026 12:43

hero78119 changed the title ~~Split ShardRam EC tree circuit~~ Split ShardRam EC tree into a dedicated circuit Jun 23, 2026

correct opt: half size on poseidon and keep ecc sum as binary tree

de5cc96

hero78119 marked this pull request as ready for review June 23, 2026 23:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split ShardRam EC tree into a dedicated circuit#1369

Split ShardRam EC tree into a dedicated circuit#1369
hero78119 wants to merge 3 commits into
masterfrom
feat/shardram_circuit

hero78119 commented Jun 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hero78119 commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Design Rationale

Change Highlights

Benchmark / Performance Impact

Operation

Layer

CI Benchmark Comparison: Reth Block 23817600

Testing

Risks and Rollout

Follow-ups (optional)

Copilot Reviewer Directive (keep this section)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hero78119 commented Jun 23, 2026 •

edited

Loading