Verifier profiling#3237
Conversation
Co-authored-by: vicsn
This reverts commit 23b3bc8.
This reverts commit f071431.
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR introduces verifier/profiling infrastructure and performance optimizations across Varuna verification and transaction/block validation, including a new Varuna/consensus version mapping to gate a functional verifier/prover change.
Changes:
- Add
ConsensusVersion::V16+VarunaVersion::V3and centralize consensus→Varuna version selection viavaruna_version_from_consensus. - Optimize verifier/validator hot paths (Varuna sponge init via SHA-256 digest in V3, Poseidon S-box specialization, BHP base preprocessing, parallel output verification).
- Add new benchmarks and artifact workflows for profiling
batch_verify,check_transaction, andprepare_advance_to_next_quorum_block.
Reviewed changes
Copilot reviewed 35 out of 37 changed files in this pull request and generated 19 comments.
Show a summary per file
| File | Description |
|---|---|
| vm/package/mod.rs | Removes now-unneeded VarunaVersion import after centralizing version mapping. |
| vm/package/execute.rs | Switches Varuna version selection to varuna_version_from_consensus. |
| synthesizer/src/vm/verify.rs | Uses centralized consensus→Varuna version mapping for verification paths. |
| synthesizer/src/vm/tests/test_v14/snark_verify.rs | Adds TODO notes and imports around Varuna versioning for V14 tests. |
| synthesizer/src/vm/mod.rs | Removes direct VarunaVersion import after refactor. |
| synthesizer/src/vm/execute.rs | Uses centralized consensus→Varuna version mapping for execution/proving paths. |
| synthesizer/process/src/verify_execution/mod.rs | Parallelizes transition output verification and reuses tcm. |
| synthesizer/benches/verifier/verifier.rs | Adds an (unregistered) verifier bench harness that generates/loads artifacts. |
| synthesizer/benches/check_transaction_multirecord/main.rs | Adds artifact-based bench for profiling check_transaction on a multi-record workload. |
| synthesizer/Cargo.toml | Wires in the check_transaction_multirecord bench and expands dev-print feature propagation. |
| ledger/src/test_helpers/chain_builder.rs | Adds deterministic timestamps option and exposes helpers to split prepare/apply for quorum blocks. |
| ledger/benches/prepare_advance_multirecord/main.rs | Adds artifact-based bench for prepare_advance_to_next_quorum_block (multi-record / transfer_public). |
| ledger/Cargo.toml | Registers the prepare_advance_multirecord bench. |
| fields/src/traits/field.rs | Optimizes Field::pow by skipping leading-zero bits. |
| console/program/Cargo.toml | Expands dev-print feature propagation into dependent console crates. |
| console/network/src/consensus_heights.rs | Adds ConsensusVersion::V16, updates heights, and introduces varuna_version_from_consensus. |
| console/network/Cargo.toml | Adds dev-print feature wiring. |
| console/algorithms/src/bhp/hasher/mod.rs | Adds combined-chunk preprocessing + optional parallel setup; includes dev-print sizing metrics. |
| console/algorithms/src/bhp/hasher/hash_uncompressed.rs | Updates hash path to use combined preprocessed lookup tables + trailing handling. |
| console/algorithms/Cargo.toml | Adds rayon dependency and serial feature gate. |
| console/Cargo.toml | Extends serial and dev-print feature wiring to include console algorithms. |
| algorithms/src/snark/varuna/varuna.rs | Adds V3 sponge init path using SHA-256 digest; threads varuna_version through init. |
| algorithms/src/snark/varuna/tests.rs | Updates “wrong version” logic to account for V3. |
| algorithms/src/snark/varuna/mode.rs | Adds VarunaVersion::V3 serialization support. |
| algorithms/src/snark/varuna/data_structures/proof.rs | Treats V3 proof sizing like V2. |
| algorithms/src/snark/varuna/ahp/verifier/verifier.rs | Treats V3 verifier challenge derivation like V2. |
| algorithms/src/snark/varuna/ahp/verifier/messages.rs | Treats V3 third-round challenge selection like V2. |
| algorithms/src/snark/varuna/ahp/prover/round_functions/third.rs | Treats V3 third-round prover behavior like V2. |
| algorithms/src/crypto_hash/poseidon.rs | Optimizes Poseidon S-box for alpha ∈ {3,5,17} via specialized exponentiation. |
| algorithms/benches/snark/varuna_verifier.rs | Adds artifact-based bench for profiling batch_verify without Criterion overhead. |
| algorithms/benches/snark/varuna.rs | Adds Criterion scaling benchmark for verify_batch on selected circuit/batch configs. |
| algorithms/Cargo.toml | Registers the new varuna_verifier bench. |
| Cargo.toml | Includes snarkvm-console/dev-print in workspace dev-print feature. |
| Cargo.lock | Adds rayon to dependency graph where needed. |
| .gitignore | Ignores varuna_verifier_artifacts/ and flamegraph.svg. |
| .circleci/config.yml | Adds a branch name to merge-workflow conditions. |
|
|
||
| Ok(sum) | ||
| } | ||
| } |
There was a problem hiding this comment.
The hasher lookup path was substantially refactored (combined-chunk indexing + trailing handling) but existing tests in this module only validate input sizing, not that the hash output is stable. Please add at least one deterministic test vector (fixed domain + fixed input bits -> expected x-coordinate / field hash) to catch accidental output changes in future optimizations.
| } | |
| } | |
| #[cfg(test)] | |
| mod tests { | |
| use super::*; | |
| type CurrentEnvironment = Console; | |
| fn reference_hash_uncompressed<const NUM_WINDOWS: u8, const WINDOW_SIZE: u8>( | |
| hasher: &BHPHasher<CurrentEnvironment, NUM_WINDOWS, WINDOW_SIZE>, | |
| input: &[bool], | |
| ) -> Group<CurrentEnvironment> { | |
| let input = if input.len() % BHP_CHUNK_SIZE != 0 { | |
| let padding = BHP_CHUNK_SIZE - (input.len() % BHP_CHUNK_SIZE); | |
| let mut padded_input = vec![false; input.len() + padding]; | |
| padded_input[..input.len()].copy_from_slice(input); | |
| Cow::Owned(padded_input) | |
| } else { | |
| Cow::Borrowed(input) | |
| }; | |
| input | |
| .chunks(WINDOW_SIZE as usize * BHP_CHUNK_SIZE) | |
| .zip(hasher.bases_lookup.iter()) | |
| .flat_map(|(window_bits, bases)| { | |
| window_bits.chunks_exact(BHP_CHUNK_SIZE).enumerate().map(move |(triplet_index, chunk_bits)| { | |
| let idx = | |
| (chunk_bits[0] as usize) | (chunk_bits[1] as usize) << 1 | (chunk_bits[2] as usize) << 2; | |
| bases[triplet_index][idx] | |
| }) | |
| }) | |
| .sum() | |
| } | |
| #[test] | |
| fn test_hash_uncompressed_regression_vector() -> Result<()> { | |
| let hasher = BHPHasher::<CurrentEnvironment, 2, 8>::setup("codeql.bhp.hash_uncompressed.regression")?; | |
| let input = vec![ | |
| true, false, true, true, false, false, true, false, true, false, true, true, false, false, true, | |
| true, false, true, false, false, true, true, true, false, true, | |
| ]; | |
| let output = hasher.hash_uncompressed(&input)?; | |
| let expected = reference_hash_uncompressed(&hasher, &input); | |
| assert_eq!(output, expected); | |
| Ok(()) | |
| } | |
| } |
There was a problem hiding this comment.
This was implicitly being done for some of the hashers by running the bench on transactions which had been generated before the hash optimisation. In any case, a full, relatively comprehensive test file with vectors for many input lengths has been added now.
There was a problem hiding this comment.
console/algorithms/src/bhp/hasher/tests.rs
|
For when we reopen this: cf. comment Łukasza: #3249 (comment) |
Verifier profiling
Overrides PR #2964. Closes #2871. Thanks goes to @vicsn and @davencyw for useful brainstorming.
This PR introduces:
A few optimisations to
check_transaction, both within the Varuna verifier (batch_verify) and outside of it. One of them affects Varuna proof generation and verification and is therefore guarded behind a newVarunaVersion::V3, corresponding in turn to a newConsensusVersion::V16. The other three are purely implementational and don’t affect the output of the functions they concern. Section Performed optimisations below explains the optimisations in question. Section Discarded approaches discusses other optimisation directions which were explored and discarded for various reasons.Machinery to benchmark and profile the following:
batch_verifyon test circuits of the given sizes and batch configurations using criterion (new benchmark case added to existing file)batch_verifyas in the previous point but without criterioncheck_transactionon a program of interest containing and a large (set of two) program(s) with many calls and output records, due to @vicsn.prepare_advance_to_next_quorum_blockon a block with 8 transactions of the type in the previous point. This benchmark can also be run on a block with 8transfer_publictransactions. Note: The number 8 comes from the boundMAXIMUM_CONFIRMED_TRANSACTIONS, which is set to 8 intestruns. This can be increased during experimentation.The commands to run the benchmarks are detailed in section What to run below.
Note that, except for the first pair, the benchmark cases above are incremental (
prepare_advance_to_next_quorum_blockcallscheck_transaction, which in turn callsbatch_verify). Crucially, the last three cases above follow the same two-step flow: 1) generate artifacts, then 2) obtain metrics. The two are done in separate commands as explained below. The main reasons for the split is:These benchmarks can also be used when testing validator/verifier (perhaps also prover) optimisations in the future.
Minor comments
varuna_version_from_consensus. This conversion was manually performed in several independent places (now updated to point to the function), giving rise to potential inconsistencies. As mentioned above, there is now a newVaruna::V3forConsensusVersion::V16onwards.MAX_DEPLOYMENT_VARIABLESandMAX_DEPLOYMENT_CONSTRAINTSfrom1 << 21to1 <<23, which was necessary to accommodate the multi-record program of interest. Work on better handling of large-deployment bounds is underway in e.g. Increase deploy limits by 8x #2955.rayonwas not a dependency in e.g.snarkvm-console. I have added it (and a corresponding--serialfeature wired appropriately) in order to parallelise the setup of the BHP hashers (which amounts to a number of EC-point additions), but if there are reasons against this, simply let me know.Results
The following table contains the time measurements on my laptop (Apple M4 Pro chip, 48 GB RAM) for reference:
For pure verification (first two horizontal blocks in the image, corresponding to the first two benchmark cases specified earlier), gains occur in all circuit/batch configurations. They are particularly dramatic (>94% speedup in parallel, 90% serial) in the circuit configuration mimicking the multi-record program of interest.
check_transactionon an execution of the multi-record program (third block in the image, third benchmark case above), the gains are also very substantial at ~77%. These come from the speedups inbatch_verify(related to the previous comments) as well as validator optimisations outside of that.Finally, in
prepare_advance_to_next_quorum_block, the gains in questions get somewhat diluted by the rest of the validator work, which has not been optimised, resulting intransfer_publictransactions only.It should be noted that, in non-test scenarios, where blocks can contain 50 transactions instead of the 8 used in the fourth benchmark and therefore
check_transactionwork accounts for a larger proportion ofprepare_advance_to_next_quorum_blockwork than in our measurements, the gains would likely be larger.What to run
All commands are indicated as executed from the root directory
snarkVM. The-p <crate_name>part of the commands can be left out if in the relevant directory.Criterion benchmark for custom test circuits:
(
algorithms/benches/snark/varuna.rs,fn snark_batch_verify_scaling)cargo bench -p snarkvm-algorithms --bench varuna --features test -- "snark_batch_verify_scaling"Profiling
batch_verifyon test circuits:(
algorithms/benches/snark/varuna_verifier.rs)To generate the proof and associated data (after setting the desired batch sizes in the file):
cargo bench -p snarkvm-algorithms --bench varuna_verifier --features test -- --generateTo get a verification-time measurement (after generating artefacts as above):
cargo bench -p snarkvm-algorithms --bench varuna_verifier --features testTo flamegraph
batch_verifyusing previously generated artefacts:cargo flamegraph -p snarkvm-algorithms --bench varuna_verifier --features "test, serial" -- --generateThe
serialflag can be removed to activate parallelisation.To clean the artefacts (which are never added to git in any case):
cargo bench -p snarkvm-algorithms --bench varuna_verifier --features test -- --cleanProfiling
check_transactionon the multi-record program:(
synthesizer/benches/check_transaction_multirecord/main.rs)To generate the deployment and execution transactions:
cargo bench -p snarkvm-synthesizer --bench check_transaction_multirecord -- --generateTo get a time measurement for
check_transaction(after generating artefacts as above):cargo bench -p snarkvm-synthesizer --bench check_transaction_multirecordTo flamegraph
check_transactionusing previously generated artefacts:cargo flamegraph -p snarkvm-synthesizer --bench check_transaction_multirecord --features serialThe
serialfrag can be removed to activate parallelisation.To clean the artifacts (which are never added to git in any case):
cargo bench -p snarkvm-synthesizer --bench check_transaction_multirecord -- --cleanProfiling
prepare_advance_to_next_quorum_blockon the multi-record program:(
ledger/benches/prepare_advance_multirecord/main.rs)To generate the deployment and execution transactions:
cargo bench -p snarkvm-ledger --bench prepare_advance_multirecord --features="test-helpers, rocks" -- --generateThe variable
n_transactionscontrols how many transactions are included in the block. If > 8, the test value ofConsensusState::MAXIMUM_CONFIRMED_TRANSACTIONShas to be increased accordingly.To get a time measurement for
check_transaction(after generating artifacts as above):cargo bench -p snarkvm-ledger --bench prepare_advance_multirecord --features="test-helpers, rocks"To clean the artifacts (which are never added to git in any case):
cargo bench -p snarkvm-ledger --bench prepare_advance_multirecord --features="test-helpers, rocks" -- --cleanTo run any of the above on
transfer_publictransactions instead of the multi-record program, include the command-line argument--transfer_public(note: --clean removes all artefacts for both types of programs; artefact generation is still separate).To remove parallelisation from any of the above, include the command-line argument
--serial. Note that deployment ofwrapper.aleo(which occurs even without the--generateflag) can take several minutes in this mode.Performed optimisations
Faster hash for sponge initialisation (functional change): When initialising the Poseidon sponge for Varuna proving/verification, we absorb some public parameters (basic circuit data) and the public inputs (cf. https://eprint.iacr.org/2023/691). This can involve large amounts of field elements (as opposed to most absorptions corresponding to the interactive part of the AHP, which are substantially reduced thanks to batching) on which Poseidon takes a sizeable amount of time. We have modified this operation so that the (same) public parameters and inputs are digested with a fast hash (SHA256), resulting in a single field element which is then absorbed by way of sponge initialisation. This is in principle compatible with potential recursive proving in the future, since the digestion can be done by the proving and verifying parties out-circuit and the digest passed to the circuit in the form of a singleton public input (as opposed to the sponge operations corresponding to AHP interaction, which need to happen in-circuit in recursive contexts).
This affects proof generation and verification and is therefore guarded behind a new
VarunaVersion::V3.Improved Poseidon S-boxes (no functional changes): An important part of the Poseidon permutation involves raising each element of the sponge’s internal state to a fixed constant
alpha(this transformation is known as the S-box). In the case of the Fiat-Shamir configuration used by Varuna, the constant has the valuealpha = 17. The other two values present in the codebase are 3 and 5. In the original implementation, this S-box called the genericpowmethod fromField. Replacing that by a custom-coded method consisting of a suitable number of calls tosquare_in_placeandmul_assignresulted in a substantial performance gain. While exploring this topic, thepowmethod itself was slightly optimised in a trivial way (switching fromBitIteratorBE::newtoBitIteratorBE::new_without_leading_zeros). Specifically, running the second benchmark case above before and after the change (commit3ecf17ea3) results in abatch_verifytime of 144ms vs. 112ms.Improved BHP performance through base preprocessing (no functional changes): We use the BHP hash in order to compress record ciphertexts into a checksum (computed in-circuit and matched against a natively computed value). For transitions outputting many records, such as the multi-record program, this takes a substantial amount of validator work. At its core, the BHP hash has (for each of our several configurations) a predefined basis consisting of a number of arrays, each containing 8 elliptic-curve points. The hash function splits the input into three-bit chunks, interprets each chunk as an index 0…7 and looks up the corresponding point in the associated 8-point basis vector. All looked up points are then added together, resulting in the hash value.
An easy way to speed up this computation is to preprocess N chunks together. For instance, for N = 2, one can consider all 8 * 8 = 64 possible sums of elements of the two 8-point vectors corresponding to two consecutive chunks. The size of this combined preprocessed table is roughly 8 times that of the original basis, and its computation incurs more setup work (which is lazily done once for each BHP configuration when needed; the plain BHP bases are not stored to or read from disk). However, the extended preprocessed reduces the hashing work (which simply consists of addition of curve points) by 50%, since now the hash can effectively look up (and add the looked-up values of) half as many 6-bit chunks as it did 3-bit chunks originally. This optimisation has been added, and the value N has been included as a customisable constant
BHP_NUM_COMBINED_CHUNKS(note: this value does not have to evenly divide the number of chunks of the given BHP configuration; trailing chunks are handled appropriately). Notably, the implementation does not change the output value of the hash and is fully backwards compatible.As part of the exploration to find a good value for
BHP_NUM_COMBINED_CHUNKS, code has been added (behind thedev-printfeature) to display the hasher’s setup time and the resulting hasher (key) size (the code contains a comment with the exact formula guiding key-size growth). The results are as follows (columns = different values ofBHP_NUM_COMBINED_CHUNKS; rows = our four different BHP hashers):In light of this, the chosen value is
BHP_NUM_COMBINED_CHUNKS = 5. This results in a total key size of 0.8GB for the four (combined) hashers (in exchange for ~5x faster BHP hashing, although this depends on the exact input length), whereas switching to the next value of 6 would result in a total ~5.8 GB of hasher keys. Setup time is very low (taking into account it is only done once), at 0.256s in total for the four hashes. Note this time corresponds toBHP::setupdone partly in parallel, which is something this PR also adds (without parallelisation, the 0.256s increase to ~2s.).Parallelisation of output verifications (suggested by @vicsn): When checking a transition, validators call
output.verify(...)on each of its outputs. This was done sequentially before, and switching to parallel verification speeds upcheck_transaction(if idle cores are available) in flows where this step is expensive. For instance, verification of output records involves hashing the published ciphertext and matching it against the corresponding public circuit input. In the multi-record program of interests, where there are ~480 output records, this is a substantial amount of hashing work.Discarded approaches
7e898fdc4and did not bring any performance improvements - however, it made the const generics a fair bit more complex (one cannot define an array with sizeRATE + CAPACITYwithout using the nightly featuregeneric_const_exprs, soRATE_PLUS_CAPACITYhad to be introduced…).RATEfield elements absorbed and some of the test cases involving many absorptions (this is the case, for instance, when many circuit instances are batched), it was conceivable increasing the rate (from the current 2 to the maximum available 8) would lead to fewer permutations (for the same amount of field elements absorbed), and therefore better performance. This was tasted and had, in fact, a ~20% negative impact on performance (parts of the permutation itself, such as the state-by-matrix product, grow with the rate too).absorb_sumsstep within Varuna: The idea was roughly to digest separate groups of elements corresponding to each instances together in parallel, then absorb the digests into the sponge. This was discarded mainly because the cryptography of it was dubious and it would likely involve a non-Poseidon hash for the digest (or a fresh Poseidon sponge), which had numerous disadvantages.