Skip to content

Split ShardRam EC tree into a dedicated circuit#1369

Open
hero78119 wants to merge 3 commits into
masterfrom
feat/shardram_circuit
Open

Split ShardRam EC tree into a dedicated circuit#1369
hero78119 wants to merge 3 commits into
masterfrom
feat/shardram_circuit

Conversation

@hero78119

@hero78119 hero78119 commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Problem

ShardRam currently mixes the leaf RAM/Poseidon work and the EC accumulation tree in one circuit. That keeps the large Poseidon-heavy leaf witness on a 2n domain even though the EC tree is the part that naturally needs the binary-tree 2n layout.

This PR splits the EC tree into a separate ShardRamEcTreeCircuit and connects the leaf and EC-tree chips through a compact custom RAM record.

Design Rationale

Golden rules for chip splitting:

  1. Best case: trade a smaller lookup domain, especially across a power-of-two boundary, for an extra product read/write argument. This is the highest-value split because it can reduce the dominant lookup proving work.
  2. Second-best case: trade smaller resident memory for an extra product read/write argument. This is still useful when cached witness/device residency is the bottleneck, but it may not improve e2e time if the added chip/product work dominates.

The split keeps the ShardRam leaf on an n-sized domain and moves only the EC tree into a dedicated 2n chip. The chip boundary is connected with RAMType::Custom records carrying the EC point (x[0..7], y[0..7]).

Soundness-sensitive points:

  • The ShardRam leaf still computes/binds the Poseidon-derived x-coordinate and y-sign/range constraints.
  • ShardRamEcTreeCircuit consumes the same EC point records and proves the EC accumulation tree separately.
  • The custom bridge rows are checked in tests so active leaf reads/writes match EC-tree writes/reads.
  • Padding/custom rows use neutral values where required; the new RAM custom read/write padding is one.

Trade-off: this reduces cached raw ShardRam witness footprint, but introduces an extra chip and custom read/write product argument. In the current Reth benchmark shape, the saved resident witness is not the dominant e2e bottleneck.

Change Highlights

  • ceno_zkvm/src/tables/shard_ram.rs
    • Split ShardRamEcTreeCircuit from ShardRamCircuit.
    • Compact the custom bridge record to ShardRamEcPoint + x + y.
    • Remove duplicated RAM/Poseidon fields from EC-tree.
    • Fix CPU Poseidon witness assignment to use config.perm_config.p3_cols[0].id instead of the old hardcoded offset.
    • Add focused selector/padding/custom-record tests.
  • ceno_zkvm/src/instructions/gpu/chips/shard_ram.rs
    • Update GPU column maps for the split leaf and EC-tree layouts.
  • ceno-gpu/cpp/common/witgen/shard_ram_per_row.cuh
    • Update EC-tree witness generation to write only x/y plus structural selector data.

Benchmark / Performance Impact

Operation

Block: 23587691, full shards, CENO_GPU_WITGEN=0, CENO_GPU_CACHE_LEVEL=1, GPU enabled.

Operation master (s) this PR (s) Ratio (master -> this PR)
reth-block 14.153 14.440 -1.02x
create_proof_of_shard, shard 0 span 4.240 4.480 -1.06x
create_proof_of_shard, shard 1 span 2.570 2.540 1.01x
app.verify 0.266 0.261 1.02x

Structured metric note: the JSON create_proof_of_shard_time_ms sample was 2568ms on master and 2542ms on this PR, but the span log is the clearer full-shard comparison because it reports both shard spans.

Layer

Layer / Memory item master this PR Ratio (master -> this PR)
ShardRam scheduled proof reservation 92.00 MiB 61.50 MiB leaf + 89.52 MiB EC-tree -1.64x scheduler reservation
ShardRam raw cached witness estimate ~378 MiB ~206.5 MiB 1.83x resident raw witness reduction
ShardRam leaf rows 262144 131072 2.00x
ShardRam leaf witness columns ~378 371 leaf no longer includes EC slope/tree columns
ShardRamEcTree rows included in baseline ShardRam 2n layout 262144 moved to separate chip

Detailed memtrack from this PR, shard 0:

Circuit rows witness columns structural columns resident main witness tower prove ecc total scheduler estimate
ShardRamCircuit 131072 371 3 1.50 MiB 16.00 MiB 45.89 MiB 0.00 MiB 61.50 MiB
ShardRamEcTreeCircuit 262144 21 7 7.00 MiB 8.00 MiB 19.89 MiB 72.52 MiB 89.52 MiB

Interpretation:

  • The intended resident raw-witness reduction is present: about 378 MiB -> 206.5 MiB, roughly 171.5 MiB saved.
  • The current e2e time does not improve because scheduler proof reservation is now split into two chips, and the EC quark allocation (72.52 MiB) remains in ShardRamEcTreeCircuit.
  • With cache=1, the scheduler resident= estimate does not include retained raw witness device backing, so it should not be used alone to judge the saved Poseidon-column footprint.

Benchmark command(s):

# master baseline and this PR used the same Reth shape:
CENO_GPU_WITGEN=0 \
CENO_CONCURRENT_CHIP_PROVING=1 \
CENO_GPU_CACHE_LEVEL=1 \
CENO_GPU_JAGGED_RESHAPE_LOG_HEIGHT=23 \
CENO_MAX_CELL_PER_SHARD=805306368 \
CENO_GPU_MEM_TRACKING=0 \
CENO_GPU_LARGE_TASK_BOOKING_MARGIN_MB=0 \
OUTPUT_PATH=<metrics-json> \
RUST_LOG=info \
cargo run --features 'jemalloc,gpu' --bin ceno-reth-benchmark-bin --release \
  --config 'patch."https://github.com/scroll-tech/ceno.git".ceno_emul.path="../ceno/ceno_emul"' \
  --config 'patch."https://github.com/scroll-tech/ceno.git".ceno_host.path="../ceno/ceno_host"' \
  --config 'patch."https://github.com/scroll-tech/ceno.git".ceno_zkvm.path="../ceno/ceno_zkvm"' \
  --config 'patch."https://github.com/scroll-tech/ceno-gpu-mock.git".ceno_gpu.path="../ceno-gpu/cuda_hal"' \
  -- \
  --block-number 23587691 \
  --chain-id 1 \
  --cache-dir block_data \
  --mode prove-app \
  --app-proofs ../ceno/app_proof.bitcode

# Extra diagnostic run for this PR, shard 0 only:
CENO_GPU_MEM_TRACKING=1 ... --shard-id 0

Environment:

  • CPU: AMD Ryzen 9 5900XT 16-Core Processor, 32 logical CPUs
  • GPU: NVIDIA GeForce RTX 5070 Ti, 16303 MiB, driver 570.172.08
  • Rust: rustc 1.93.0-nightly (07bdbaedc 2025-11-19)
  • Branch: feat/shardram_circuit
  • This PR commit tested locally: 8162e6f45a53226a93bbf05bd03fd9edb163d53d
  • Baseline master commit tested locally: 678910c71624ab69ea776a82f3ec99971cc3e6d9

Raw data:

  • master:
    • metrics_23587691_full_upstream_master_witgen0_cache1_h23_maxcell6_localcenogpu_gkrpath.json
    • sanity_23587691_full_upstream_master_witgen0_cache1_h23_maxcell6_localcenogpu_gkrpath.log
  • this PR:
    • metrics_23587691_full_witgen0_cache1_shardram_split_current_gkrpatch_20260623.json
    • sanity_23587691_full_witgen0_cache1_shardram_split_current_gkrpatch_20260623.log
    • memtrack diagnostic: sanity_23587691_shard0_memtrack_shardram_split_current_20260623.log

CI Benchmark Comparison: Reth Block 23817600

Comparison source:

Both runs passed. No artifacts were published, so the metrics below were extracted from the GitHub Actions log archive 1_benchmark.txt.

Metric Baseline Feature Ratio
Ceno ref feat/opt#1e2ff5b feat/shardram_circuit#de5cc96c changed
Ceno-GPU ref feat/opt_sc_first_round#c61c925 feat/shardram_circuit#134576c4 changed
Block 23817600 23817600 same
Shards 13 13 same
reth-block 72.7s 69.8s 1.04x faster
sum create_proof_of_shard 52.86s 50.03s 1.06x faster
app.verify 1.97s 1.96s 1.01x faster
sum generate_witness 40.03s 39.22s 1.02x faster
proof size 24,899,592 bytes, 23.75 MiB 24,976,657 bytes, 23.82 MiB -1.00x larger
total verifier chip groups 710 723 -1.02x, exactly +1 per shard

Per-shard create_proof_of_shard spans:

Shard Baseline Feature Ratio
0 4.64s 4.49s 1.03x faster
1 4.08s 4.06s 1.00x faster
2 4.30s 4.18s 1.03x faster
3 4.60s 4.39s 1.05x faster
4 4.57s 4.28s 1.07x faster
5 4.34s 4.23s 1.03x faster
6 4.20s 3.89s 1.08x faster
7 4.88s 4.68s 1.04x faster
8 4.21s 3.78s 1.11x faster
9 3.99s 3.64s 1.10x faster
10 4.13s 3.77s 1.10x faster
11 3.76s 3.42s 1.10x faster
12 1.16s 1.22s -1.05x slower

Interpretation:

  • The CI e2e improvement is real for this feature-branch comparison: reth-block improves from 72.7s to 69.8s, or 1.04x faster.
  • The improvement comes mainly from shard proving. The summed create_proof_of_shard spans are 1.06x faster, almost matching the reth-block gain.
  • Verification is flat and proof size is slightly larger, so the win is not from smaller proof output or fewer verifier chip groups.
  • The feature adds one chip group per shard, consistent with the new ShardRamEcTreeCircuit.
  • The logs show the expected split effect: baseline has ShardRamCircuit estimated=170.52MB; feature has ShardRamCircuit estimated=104.39MB plus ShardRamEcTreeCircuit estimated=168.52MB. The ShardRam leaf got smaller, while the new EC-tree chip adds work.
  • Caveat: this is not a pure ShardRam split A/B. The feature run also switches Ceno, Ceno-GPU, and GKR refs. The defensible conclusion is that the feature branch improves e2e because shard proving is faster across most shards despite a slightly larger proof and one extra chip per shard. This CI alone does not isolate how much of the improvement comes from the split itself versus the newer dependency set.

Testing

cargo test --config net.git-fetch-with-cli=true --package ceno_zkvm --lib \
  tables::shard_ram::tests::test_shard_ram_split_selectors_and_tower_padding -- --nocapture

cargo test --config net.git-fetch-with-cli=true --package ceno_zkvm --lib \
  tables::shard_ram::tests::test_shard_ram_circuit -- --nocapture

cargo run --config net.git-fetch-with-cli=true --release --package ceno_zkvm --bin e2e -- \
  --platform=ceno \
  --max-cycle-per-shard=1600 \
  examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall

cargo run --config net.git-fetch-with-cli=true \
  --config 'patch."https://github.com/scroll-tech/ceno-gpu-mock.git".ceno_gpu.path="../ceno-gpu/cuda_hal"' \
  --release --package ceno_zkvm --features gpu --bin e2e -- \
  --platform=ceno \
  --max-cycle-per-shard=1600 \
  examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall

# Reth full-shard GPU validation, cache=1/witgen=0/full shards, command shape shown above.

Outcomes:

  • ShardRam selector/padding/custom-record test: passed.
  • ShardRam circuit test: passed.
  • CPU keccak_syscall e2e: passed.
  • GPU keccak_syscall e2e: passed.
  • Reth full-shard GPU validation: passed, final exit code 0. Success.

Risks and Rollout

  • Soundness risk is concentrated in the new custom bridge between leaf and EC-tree. This is covered by active-row custom read/write matching tests and CPU/GPU e2e validation.
  • Performance risk: this split reduces cached raw witness VRAM but does not currently improve Reth e2e proof time under the tested cache=1, witgen=0 shape. The extra chip and custom product argument can offset the raw witness saving.
  • Rollback is local to the ShardRam split: revert the EC-tree chip split and restore the single-chip ShardRam layout.

Follow-ups (optional)

  • Add first-class metrics for retained raw witness device backing so cache=1 resident savings are visible directly in benchmark output.
  • Investigate whether ShardRam leaf tower prove can be reduced by shrinking the custom bridge product argument or avoiding materialized main-witness outputs that are not needed across the full tower prove stage.
  • Investigate scheduling policy so the extra EC-tree chip does not erase the resident-witness benefit in end-to-end proof time.

Copilot Reviewer Directive (keep this section)

When Copilot reviews this PR, apply .github/copilot-instructions.md strictly.

@hero78119 hero78119 marked this pull request as draft June 23, 2026 12:43
@hero78119 hero78119 changed the title Split ShardRam EC tree circuit Split ShardRam EC tree into a dedicated circuit Jun 23, 2026
@hero78119 hero78119 marked this pull request as ready for review June 23, 2026 23:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant