Skip to content

Verifier profiling#3237

Draft
Antonio95 wants to merge 52 commits into
stagingfrom
profile_verifier_multirecord
Draft

Verifier profiling#3237
Antonio95 wants to merge 52 commits into
stagingfrom
profile_verifier_multirecord

Conversation

@Antonio95

Copy link
Copy Markdown
Contributor

Verifier profiling

Overrides PR #2964. Closes #2871. Thanks goes to @vicsn and @davencyw for useful brainstorming.

This PR introduces:

  • A few optimisations to check_transaction, both within the Varuna verifier (batch_verify) and outside of it. One of them affects Varuna proof generation and verification and is therefore guarded behind a new VarunaVersion::V3, corresponding in turn to a new ConsensusVersion::V16. The other three are purely implementational and don’t affect the output of the functions they concern. Section Performed optimisations below explains the optimisations in question. Section Discarded approaches discusses other optimisation directions which were explored and discarded for various reasons.

  • Machinery to benchmark and profile the following:

    • Varuna’s batch_verify on test circuits of the given sizes and batch configurations using criterion (new benchmark case added to existing file)
    • Varuna’s batch_verify as in the previous point but without criterion
    • The method check_transaction on a program of interest containing and a large (set of two) program(s) with many calls and output records, due to @vicsn.
    • The method prepare_advance_to_next_quorum_block on a block with 8 transactions of the type in the previous point. This benchmark can also be run on a block with 8 transfer_public transactions. Note: The number 8 comes from the bound MAXIMUM_CONFIRMED_TRANSACTIONS, which is set to 8 in test runs. This can be increased during experimentation.

    The commands to run the benchmarks are detailed in section What to run below.

    Note that, except for the first pair, the benchmark cases above are incremental (prepare_advance_to_next_quorum_block calls check_transaction, which in turn calls batch_verify). Crucially, the last three cases above follow the same two-step flow: 1) generate artifacts, then 2) obtain metrics. The two are done in separate commands as explained below. The main reasons for the split is:

    • To obtain metrics without regenerating proofs every time (this can be done as long as the changes being tested are not functional - at that point, regeneration is needed).
    • More importantly: to be able to profile with flamegraph. Since verification is orders of magnitude faster, attempting to flamegraph validator work in a run which also produces the proofs/transactions being validated results in very poor metrics. Instead, we dedicate a separate execution to artifact generation so that the main one can focus on validator work.

These benchmarks can also be used when testing validator/verifier (perhaps also prover) optimisations in the future.

Minor comments

  • This PR also introduces the (trivial) auxiliary function varuna_version_from_consensus. This conversion was manually performed in several independent places (now updated to point to the function), giving rise to potential inconsistencies. As mentioned above, there is now a new Varuna::V3 for ConsensusVersion::V16 onwards.
  • This PR also modifies the constants MAX_DEPLOYMENT_VARIABLES and MAX_DEPLOYMENT_CONSTRAINTS from 1 << 21 to 1 <<23, which was necessary to accommodate the multi-record program of interest. Work on better handling of large-deployment bounds is underway in e.g. Increase deploy limits by 8x #2955.
  • It seems rayon was not a dependency in e.g. snarkvm-console. I have added it (and a corresponding --serial feature wired appropriately) in order to parallelise the setup of the BHP hashers (which amounts to a number of EC-point additions), but if there are reasons against this, simply let me know.

Results

The following table contains the time measurements on my laptop (Apple M4 Pro chip, 48 GB RAM) for reference:

Screenshot 2026-04-30 at 17 25 39

For pure verification (first two horizontal blocks in the image, corresponding to the first two benchmark cases specified earlier), gains occur in all circuit/batch configurations. They are particularly dramatic (>94% speedup in parallel, 90% serial) in the circuit configuration mimicking the multi-record program of interest.

check_transaction on an execution of the multi-record program (third block in the image, third benchmark case above), the gains are also very substantial at ~77%. These come from the speedups in batch_verify (related to the previous comments) as well as validator optimisations outside of that.

Finally, in prepare_advance_to_next_quorum_block, the gains in questions get somewhat diluted by the rest of the validator work, which has not been optimised, resulting in

  • a 60% speedup in parallel and 67% speedup serial for a block with transactions of the multi-record-program type only.
  • still, a 30% speedup in parallel and 22% speedup serial for a block with transfer_public transactions only.

It should be noted that, in non-test scenarios, where blocks can contain 50 transactions instead of the 8 used in the fourth benchmark and therefore check_transaction work accounts for a larger proportion of prepare_advance_to_next_quorum_block work than in our measurements, the gains would likely be larger.

What to run

All commands are indicated as executed from the root directory snarkVM . The -p <crate_name> part of the commands can be left out if in the relevant directory.

  • Criterion benchmark for custom test circuits:

    (algorithms/benches/snark/varuna.rs, fn snark_batch_verify_scaling)

    cargo bench -p snarkvm-algorithms --bench varuna --features test -- "snark_batch_verify_scaling"

  • Profiling batch_verify on test circuits:

    (algorithms/benches/snark/varuna_verifier.rs)

    • To generate the proof and associated data (after setting the desired batch sizes in the file):

      cargo bench -p snarkvm-algorithms --bench varuna_verifier --features test -- --generate

    • To get a verification-time measurement (after generating artefacts as above):

      cargo bench -p snarkvm-algorithms --bench varuna_verifier --features test

    • To flamegraph batch_verify using previously generated artefacts:

      cargo flamegraph -p snarkvm-algorithms --bench varuna_verifier --features "test, serial" -- --generate

      The serial flag can be removed to activate parallelisation.

    • To clean the artefacts (which are never added to git in any case):

      cargo bench -p snarkvm-algorithms --bench varuna_verifier --features test -- --clean

  • Profiling check_transaction on the multi-record program:

    (synthesizer/benches/check_transaction_multirecord/main.rs)

    • To generate the deployment and execution transactions:

      cargo bench -p snarkvm-synthesizer --bench check_transaction_multirecord -- --generate

    • To get a time measurement for check_transaction (after generating artefacts as above):

      cargo bench -p snarkvm-synthesizer --bench check_transaction_multirecord

    • To flamegraph check_transaction using previously generated artefacts:

      cargo flamegraph -p snarkvm-synthesizer --bench check_transaction_multirecord --features serial

      The serial frag can be removed to activate parallelisation.

    • To clean the artifacts (which are never added to git in any case):

      cargo bench -p snarkvm-synthesizer --bench check_transaction_multirecord -- --clean

  • Profiling prepare_advance_to_next_quorum_block on the multi-record program:

    (ledger/benches/prepare_advance_multirecord/main.rs)

    • To generate the deployment and execution transactions:

      cargo bench -p snarkvm-ledger --bench prepare_advance_multirecord --features="test-helpers, rocks" -- --generate

      The variable n_transactions controls how many transactions are included in the block. If > 8, the test value of ConsensusState::MAXIMUM_CONFIRMED_TRANSACTIONS has to be increased accordingly.

    • To get a time measurement for check_transaction (after generating artifacts as above):

      cargo bench -p snarkvm-ledger --bench prepare_advance_multirecord --features="test-helpers, rocks"

    • To clean the artifacts (which are never added to git in any case):

      cargo bench -p snarkvm-ledger --bench prepare_advance_multirecord --features="test-helpers, rocks" -- --clean

    • To run any of the above on transfer_public transactions instead of the multi-record program, include the command-line argument --transfer_public (note: --clean removes all artefacts for both types of programs; artefact generation is still separate).

    • To remove parallelisation from any of the above, include the command-line argument --serial. Note that deployment of wrapper.aleo (which occurs even without the --generate flag) can take several minutes in this mode.

    Performed optimisations

    • Faster hash for sponge initialisation (functional change): When initialising the Poseidon sponge for Varuna proving/verification, we absorb some public parameters (basic circuit data) and the public inputs (cf. https://eprint.iacr.org/2023/691). This can involve large amounts of field elements (as opposed to most absorptions corresponding to the interactive part of the AHP, which are substantially reduced thanks to batching) on which Poseidon takes a sizeable amount of time. We have modified this operation so that the (same) public parameters and inputs are digested with a fast hash (SHA256), resulting in a single field element which is then absorbed by way of sponge initialisation. This is in principle compatible with potential recursive proving in the future, since the digestion can be done by the proving and verifying parties out-circuit and the digest passed to the circuit in the form of a singleton public input (as opposed to the sponge operations corresponding to AHP interaction, which need to happen in-circuit in recursive contexts).

      This affects proof generation and verification and is therefore guarded behind a new VarunaVersion::V3.

    • Improved Poseidon S-boxes (no functional changes): An important part of the Poseidon permutation involves raising each element of the sponge’s internal state to a fixed constant alpha (this transformation is known as the S-box). In the case of the Fiat-Shamir configuration used by Varuna, the constant has the value alpha = 17. The other two values present in the codebase are 3 and 5. In the original implementation, this S-box called the generic pow method from Field. Replacing that by a custom-coded method consisting of a suitable number of calls to square_in_place and mul_assign resulted in a substantial performance gain. While exploring this topic, the pow method itself was slightly optimised in a trivial way (switching from BitIteratorBE::new to BitIteratorBE::new_without_leading_zeros). Specifically, running the second benchmark case above before and after the change (commit 3ecf17ea3) results in a batch_verify time of 144ms vs. 112ms.

    • Improved BHP performance through base preprocessing (no functional changes): We use the BHP hash in order to compress record ciphertexts into a checksum (computed in-circuit and matched against a natively computed value). For transitions outputting many records, such as the multi-record program, this takes a substantial amount of validator work. At its core, the BHP hash has (for each of our several configurations) a predefined basis consisting of a number of arrays, each containing 8 elliptic-curve points. The hash function splits the input into three-bit chunks, interprets each chunk as an index 0…7 and looks up the corresponding point in the associated 8-point basis vector. All looked up points are then added together, resulting in the hash value.

      An easy way to speed up this computation is to preprocess N chunks together. For instance, for N = 2, one can consider all 8 * 8 = 64 possible sums of elements of the two 8-point vectors corresponding to two consecutive chunks. The size of this combined preprocessed table is roughly 8 times that of the original basis, and its computation incurs more setup work (which is lazily done once for each BHP configuration when needed; the plain BHP bases are not stored to or read from disk). However, the extended preprocessed reduces the hashing work (which simply consists of addition of curve points) by 50%, since now the hash can effectively look up (and add the looked-up values of) half as many 6-bit chunks as it did 3-bit chunks originally. This optimisation has been added, and the value N has been included as a customisable constant BHP_NUM_COMBINED_CHUNKS (note: this value does not have to evenly divide the number of chunks of the given BHP configuration; trailing chunks are handled appropriately). Notably, the implementation does not change the output value of the hash and is fully backwards compatible.

      As part of the exploration to find a good value for BHP_NUM_COMBINED_CHUNKS, code has been added (behind the dev-print feature) to display the hasher’s setup time and the resulting hasher (key) size (the code contains a comment with the exact formula guiding key-size growth). The results are as follows (columns = different values of BHP_NUM_COMBINED_CHUNKS; rows = our four different BHP hashers):

      Screenshot 2026-04-30 at 18 17 44

      In light of this, the chosen value is BHP_NUM_COMBINED_CHUNKS = 5. This results in a total key size of 0.8GB for the four (combined) hashers (in exchange for ~5x faster BHP hashing, although this depends on the exact input length), whereas switching to the next value of 6 would result in a total ~5.8 GB of hasher keys. Setup time is very low (taking into account it is only done once), at 0.256s in total for the four hashes. Note this time corresponds to BHP::setup done partly in parallel, which is something this PR also adds (without parallelisation, the 0.256s increase to ~2s.).

    • Parallelisation of output verifications (suggested by @vicsn): When checking a transition, validators call output.verify(...) on each of its outputs. This was done sequentially before, and switching to parallel verification speeds up check_transaction (if idle cores are available) in flows where this step is expensive. For instance, verification of output records involves hashing the published ciphertext and matching it against the corresponding public circuit input. In the multi-record program of interests, where there are ~480 output records, this is a substantial amount of hashing work.

    Discarded approaches

    • Concatenating the Poseidon state: A Poseidon sponge is constructed from a permutation which is applied to the sponge’s internal state. A portion of that state corresponds to the sponge’s rate (the part which is read when squeezing and to which elements are added when absorbing). The remainder of the state corresponds to the sponge’s capacity (its “internal randomness”, which is neither read when squeezing or modified when absorbing). We currently keep the two parts as separate arrays and it was conceivable that concatenating them into a single array might improve memory access/allocation (for instance, part of the Poseidon permutation involves the product of the full state vector by a matrix). This was tested in 7e898fdc4 and did not bring any performance improvements - however, it made the const generics a fair bit more complex (one cannot define an array with size RATE + CAPACITY without using the nightly feature generic_const_exprs , so RATE_PLUS_CAPACITY had to be introduced…).
    • Increased Poseidon rate: Since Poseidon performs one permutation per RATE field elements absorbed and some of the test cases involving many absorptions (this is the case, for instance, when many circuit instances are batched), it was conceivable increasing the rate (from the current 2 to the maximum available 8) would lead to fewer permutations (for the same amount of field elements absorbed), and therefore better performance. This was tasted and had, in fact, a ~20% negative impact on performance (parts of the permutation itself, such as the state-by-matrix product, grow with the rate too).
    • Parallelise the expensive absorb_sums step within Varuna: The idea was roughly to digest separate groups of elements corresponding to each instances together in parallel, then absorb the digests into the sponge. This was discarded mainly because the cryptography of it was dubious and it would likely involve a non-Poseidon hash for the digest (or a fresh Poseidon sponge), which had numerous disadvantages.

Antonio95 and others added 30 commits April 16, 2026 16:51
Co-authored-by: vicsn

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces verifier/profiling infrastructure and performance optimizations across Varuna verification and transaction/block validation, including a new Varuna/consensus version mapping to gate a functional verifier/prover change.

Changes:

  • Add ConsensusVersion::V16 + VarunaVersion::V3 and centralize consensus→Varuna version selection via varuna_version_from_consensus.
  • Optimize verifier/validator hot paths (Varuna sponge init via SHA-256 digest in V3, Poseidon S-box specialization, BHP base preprocessing, parallel output verification).
  • Add new benchmarks and artifact workflows for profiling batch_verify, check_transaction, and prepare_advance_to_next_quorum_block.

Reviewed changes

Copilot reviewed 35 out of 37 changed files in this pull request and generated 19 comments.

Show a summary per file
File Description
vm/package/mod.rs Removes now-unneeded VarunaVersion import after centralizing version mapping.
vm/package/execute.rs Switches Varuna version selection to varuna_version_from_consensus.
synthesizer/src/vm/verify.rs Uses centralized consensus→Varuna version mapping for verification paths.
synthesizer/src/vm/tests/test_v14/snark_verify.rs Adds TODO notes and imports around Varuna versioning for V14 tests.
synthesizer/src/vm/mod.rs Removes direct VarunaVersion import after refactor.
synthesizer/src/vm/execute.rs Uses centralized consensus→Varuna version mapping for execution/proving paths.
synthesizer/process/src/verify_execution/mod.rs Parallelizes transition output verification and reuses tcm.
synthesizer/benches/verifier/verifier.rs Adds an (unregistered) verifier bench harness that generates/loads artifacts.
synthesizer/benches/check_transaction_multirecord/main.rs Adds artifact-based bench for profiling check_transaction on a multi-record workload.
synthesizer/Cargo.toml Wires in the check_transaction_multirecord bench and expands dev-print feature propagation.
ledger/src/test_helpers/chain_builder.rs Adds deterministic timestamps option and exposes helpers to split prepare/apply for quorum blocks.
ledger/benches/prepare_advance_multirecord/main.rs Adds artifact-based bench for prepare_advance_to_next_quorum_block (multi-record / transfer_public).
ledger/Cargo.toml Registers the prepare_advance_multirecord bench.
fields/src/traits/field.rs Optimizes Field::pow by skipping leading-zero bits.
console/program/Cargo.toml Expands dev-print feature propagation into dependent console crates.
console/network/src/consensus_heights.rs Adds ConsensusVersion::V16, updates heights, and introduces varuna_version_from_consensus.
console/network/Cargo.toml Adds dev-print feature wiring.
console/algorithms/src/bhp/hasher/mod.rs Adds combined-chunk preprocessing + optional parallel setup; includes dev-print sizing metrics.
console/algorithms/src/bhp/hasher/hash_uncompressed.rs Updates hash path to use combined preprocessed lookup tables + trailing handling.
console/algorithms/Cargo.toml Adds rayon dependency and serial feature gate.
console/Cargo.toml Extends serial and dev-print feature wiring to include console algorithms.
algorithms/src/snark/varuna/varuna.rs Adds V3 sponge init path using SHA-256 digest; threads varuna_version through init.
algorithms/src/snark/varuna/tests.rs Updates “wrong version” logic to account for V3.
algorithms/src/snark/varuna/mode.rs Adds VarunaVersion::V3 serialization support.
algorithms/src/snark/varuna/data_structures/proof.rs Treats V3 proof sizing like V2.
algorithms/src/snark/varuna/ahp/verifier/verifier.rs Treats V3 verifier challenge derivation like V2.
algorithms/src/snark/varuna/ahp/verifier/messages.rs Treats V3 third-round challenge selection like V2.
algorithms/src/snark/varuna/ahp/prover/round_functions/third.rs Treats V3 third-round prover behavior like V2.
algorithms/src/crypto_hash/poseidon.rs Optimizes Poseidon S-box for alpha ∈ {3,5,17} via specialized exponentiation.
algorithms/benches/snark/varuna_verifier.rs Adds artifact-based bench for profiling batch_verify without Criterion overhead.
algorithms/benches/snark/varuna.rs Adds Criterion scaling benchmark for verify_batch on selected circuit/batch configs.
algorithms/Cargo.toml Registers the new varuna_verifier bench.
Cargo.toml Includes snarkvm-console/dev-print in workspace dev-print feature.
Cargo.lock Adds rayon to dependency graph where needed.
.gitignore Ignores varuna_verifier_artifacts/ and flamegraph.svg.
.circleci/config.yml Adds a branch name to merge-workflow conditions.

Comment thread synthesizer/src/vm/tests/test_v14/snark_verify.rs Outdated
Comment thread synthesizer/benches/verifier/verifier.rs Outdated
Comment thread synthesizer/benches/verifier/verifier.rs Outdated
Comment thread console/algorithms/src/bhp/hasher/mod.rs Outdated
Comment thread console/algorithms/src/bhp/hasher/mod.rs Outdated

Ok(sum)
}
}

Copilot AI Apr 30, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hasher lookup path was substantially refactored (combined-chunk indexing + trailing handling) but existing tests in this module only validate input sizing, not that the hash output is stable. Please add at least one deterministic test vector (fixed domain + fixed input bits -> expected x-coordinate / field hash) to catch accidental output changes in future optimizations.

Suggested change
}
}
#[cfg(test)]
mod tests {
use super::*;
type CurrentEnvironment = Console;
fn reference_hash_uncompressed<const NUM_WINDOWS: u8, const WINDOW_SIZE: u8>(
hasher: &BHPHasher<CurrentEnvironment, NUM_WINDOWS, WINDOW_SIZE>,
input: &[bool],
) -> Group<CurrentEnvironment> {
let input = if input.len() % BHP_CHUNK_SIZE != 0 {
let padding = BHP_CHUNK_SIZE - (input.len() % BHP_CHUNK_SIZE);
let mut padded_input = vec![false; input.len() + padding];
padded_input[..input.len()].copy_from_slice(input);
Cow::Owned(padded_input)
} else {
Cow::Borrowed(input)
};
input
.chunks(WINDOW_SIZE as usize * BHP_CHUNK_SIZE)
.zip(hasher.bases_lookup.iter())
.flat_map(|(window_bits, bases)| {
window_bits.chunks_exact(BHP_CHUNK_SIZE).enumerate().map(move |(triplet_index, chunk_bits)| {
let idx =
(chunk_bits[0] as usize) | (chunk_bits[1] as usize) << 1 | (chunk_bits[2] as usize) << 2;
bases[triplet_index][idx]
})
})
.sum()
}
#[test]
fn test_hash_uncompressed_regression_vector() -> Result<()> {
let hasher = BHPHasher::<CurrentEnvironment, 2, 8>::setup("codeql.bhp.hash_uncompressed.regression")?;
let input = vec![
true, false, true, true, false, false, true, false, true, false, true, true, false, false, true,
true, false, true, false, false, true, true, true, false, true,
];
let output = hasher.hash_uncompressed(&input)?;
let expected = reference_hash_uncompressed(&hasher, &input);
assert_eq!(output, expected);
Ok(())
}
}

Copilot uses AI. Check for mistakes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was implicitly being done for some of the hashers by running the bench on transactions which had been generated before the hash optimisation. In any case, a full, relatively comprehensive test file with vectors for many input lengths has been added now.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

console/algorithms/src/bhp/hasher/tests.rs

Comment thread synthesizer/benches/check_transaction_multirecord/main.rs
Comment thread synthesizer/Cargo.toml
Comment thread algorithms/src/snark/varuna/tests.rs Outdated
Comment thread synthesizer/benches/check_transaction_multirecord/main.rs Outdated
@vicsn vicsn requested review from davencyw and mohammadfawaz May 1, 2026 06:28
@Antonio95 Antonio95 marked this pull request as draft June 22, 2026 13:50
@Antonio95

Copy link
Copy Markdown
Contributor Author

For when we reopen this: cf. comment Łukasza: #3249 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Perf] Speed up batch instance proof verification

7 participants