Skip to content

mourningdove007/zk-constraint-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ZK Constraint Dataset

A dataset of zero-knowledge circuit examples for fine-tuning LLMs to identify insufficient constraints and other security vulnerabilities in ZK proof systems. Examples are framed as an auditor reviewing a piece of circuit / proof-system code and reporting the vulnerability and a fix.

Scope

Originally Circom-only, the dataset now generalizes across ZK DSLs and frameworks. The shared system prompt reflects this:

You are a Zero-Knowledge Proof security auditor. You review ZK circuits and proof-system code across DSLs and frameworks (such as Circom, Halo2, gnark, Arkworks, Cairo, Plonky3, and RISC Zero) to identify under-constrained signals, soundness and completeness violations, and other logical or cryptographic vulnerabilities.

Vulnerability classes covered include under-constrained / unconstrained signals, missing range and boolean checks, missing input bindings, comparator/field overflows, soundness and completeness breaks, nullifier/replay issues, and business-logic bypasses.

Files

  • train.jsonl: training examples (hand-written pedagogical circuits + audit-report-derived rows).
  • valid.jsonl: held-out validation examples.
  • synthetic.jsonl: additional synthetic examples.
  • zk_security.jsonl: examples derived from public security audit findings; every row carries a source object.

Data Sources

Audit-derived rows are built from the public zkbugs collection of ZK vulnerability reports and from public security advisories, issues, and pull requests. Findings are credited to their auditors (Veridise, ZKSecurity, Zellic, Trail of Bits, ABDK, yAcademy, Matter Labs, OpenZeppelin, HashCloak, Least Authority, NCC Group, Hexens, Inference, and independent researchers) in each row's source object. Where the original codebase is available the vulnerable code is used verbatim; otherwise the circuit is faithfully reconstructed from the finding's description, location, and proposed mitigation. Every audit-derived row includes a source object (name, link, protocol) so readers can open the underlying report.

License

Original dataset structure, synthetic examples, and editorial content are released under MIT.

Audit-derived rows are based on code from upstream projects and inherit their respective licenses as documented in the table below. Users must comply with the upstream license for any rows they use or redistribute. This dataset's only source-available (non-open-source) license is BSL-1.1 (see note on Panther Protocol).

Upstream code provenance & licensing

Each audit-derived example is based on code from the upstream project listed below. The license shown is the project's declared license; review and comply with it before redistributing or training on the corresponding rows.

Project Upstream repository License Auditor(s) Rows
Aptos Keyless aptos-labs/keyless-zk-proofs GPL-3.0 koukyosyumei 1
Arianee Arianee/arianee-sdk MIT (package.json) Veridise 2
arkworks r1cs-std arkworks-rs/r1cs-std MIT OR Apache-2.0 arkworks 1
Blake3 Circom banyancomputer/hot-proofs-blake3-circom MIT koukyosyumei 1
circom-bigint 0xbok/circom-bigint GPL-3.0 Veridise 1
circomlib iden3/circomlib LGPL-3.0 Kobi Gurkan, Veridise 9
Filecoin rust-fil-proofs filecoin-project/rust-fil-proofs MIT OR Apache-2.0 Trapdoor Tech 1
gnark Consensys/gnark Apache-2.0 Consensys 1
Herodotus HerodotusDev/offchain-evm-headers-processor GPL-3.0 ABDK 7
iden3 circuits iden3/circuits GPL-3.0 Trail of Bits 1
Lurk lurk-lab/lurk-rs MIT OR Apache-2.0 Inference 5
MACI privacy-scaling-explorations/maci MIT HashCloak 1
Neptune lurk-lab/neptune MIT OR Apache-2.0 Inference 1
OpenVM openvm-org/openvm MIT OR Apache-2.0 OpenVM 2
Panther Protocol pantherfoundation/panther-core BSL-1.1 at audited commit → since relicensed LGPL-3.0 (see note) Veridise 12
Penumbra penumbra-zone/penumbra MIT OR Apache-2.0 NCC Group, ZKSecurity 4
Polygon zkEVM 0xPolygonHermez/zkevm-proverjs AGPL-3.0 Hexens 2
Rarimo Passport rarimo/passport-zk-circuits MIT koukyosyumei 1
RISC Zero risc0/risc0 Apache-2.0 RISC Zero 3
RLN Rate-Limiting-Nullifier/circom-rln MIT OR Apache-2.0 Veridise 1
Scroll MPT scroll-tech/mpt-circuit MIT Trail of Bits, Zellic 7
Scroll Poseidon scroll-tech/poseidon-circuit Apache-2.0 Zellic 1
Scroll zkEVM scroll-tech/zkevm-circuits MIT Trail of Bits, Zellic 26
Self Protocol selfxyz/self MIT (circuits package; monorepo is per-package) ZKSecurity 8
Semaphore semaphore-protocol/semaphore MIT Veridise 1
Sismo Hydra-S2 sismo-core/hydra-s2-zkps MIT Veridise 1
SIV siv-org/verifiable-private-overrides ISC (package.json) koukyosyumei 2
SP1 succinctlabs/sp1 MIT OR Apache-2.0 Succinct 6
Spartan-ECDSA personaelabs/spartan-ecdsa MIT (package.json) yAcademy 3
StarkEx Perpetual starkware-libs/stark-perpetual Apache-2.0 ABDK 1
Summa summa-dev/summa-solvency Apache-2.0 Summa 1
Tangle Network tangle-network/protocol-solidity MIT OR Apache-2.0 Veridise 1
Telepathy succinctlabs/telepathy-circuits GPL-3.0 Trail of Bits, Veridise 6
UniRep Unirep/Unirep MIT Veridise 2
WORM Proof of Burn worm-privacy/proof-of-burn MIT koukyosyumei 1
ZK Email zkemail/zk-email-verify MIT Matter Labs, ZKSecurity 5
Zkopru zkopru-network/zkopru GPL-3.0 Least Authority 2
zkSync Social Login Moonsong-Labs/zksync-social-login-circuit MIT (package.json) OpenZeppelin 2

Notes:

  • Licenses for the Circom projects were read from the LICENSE/package.json at the exact audited commit the code was taken from; for the other projects the value is GitHub's detected SPDX license on the default branch. Where a project's license changed over time, the table reflects the version relevant to the code we used.
  • ⚠️ Panther Protocol license change. At the audited commit the code we used was licensed under the Business Source License 1.1 (BSL-1.1), a source-available (not OSI-approved, not open-source) license with use restrictions. The project has since relicensed upstream to LGPL-3.0. The 12 Panther-derived rows reflect BSL-1.1-era code, so evaluate them against BSL-1.1 terms (the LGPL-3.0 relicensing does not necessarily apply retroactively to that commit). This is the dataset's only source-available (non-open-source) license.
  • MIT OR Apache-2.0 denotes dual-licensed projects (two LICENSE-* files); recipients may choose either license.
  • Sources without a clear, verifiable license were excluded from the dataset: Dark Forest (darkforest-v0.3, all rights reserved / review-only), Inference Labs Subnet (subnet-2-circom, no license declared), and Reclaim Chacha20 (circom-chacha20, repository no longer reachable). Rows derived from these projects have been removed from all *.jsonl files.
  • License identifiers are provided for convenience and are not legal advice.

Data Format

Each line in a JSONL file is one complete training example. Audit-derived rows include a source object; purely synthetic examples may omit it.

{
  "messages": [
    {"role": "system", "content": "You are a Zero-Knowledge Proof security auditor. You review ZK circuits and proof-system code across DSLs and frameworks (such as Circom, Halo2, gnark, Arkworks, Cairo, Plonky3, and RISC Zero) to identify under-constrained signals, soundness and completeness violations, and other logical or cryptographic vulnerabilities."},
    {"role": "user", "content": "Audit this circuit for vulnerabilities:\n\n```circom\n...```"},
    {"role": "assistant", "content": "Vulnerability: ...\n\nExplanation: ...\n\nFix: ...\n\n```circom\n...```"}
  ],
  "source": {
    "name": "...",
    "link": "https://...",
    "protocol": "..."
  }
}

The fenced code block in the user / assistant turns is tagged with the relevant language (circom, rust, go, cairo, ...) for non-Circom examples.

Related

Deploying

To deploy a new version of the dataset, we use the following command:

hf upload mourningdove/zk-constraint-data . --repo-type=dataset

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors