ZK Constraint Dataset

A dataset of zero-knowledge circuit examples for fine-tuning LLMs to identify insufficient constraints and other security vulnerabilities in ZK proof systems. Examples are framed as an auditor reviewing a piece of circuit / proof-system code and reporting the vulnerability and a fix.

Scope

Originally Circom-only, the dataset now generalizes across ZK DSLs and frameworks. The shared system prompt reflects this:

You are a Zero-Knowledge Proof security auditor. You review ZK circuits and proof-system code across DSLs and frameworks (such as Circom, Halo2, gnark, Arkworks, Cairo, Plonky3, and RISC Zero) to identify under-constrained signals, soundness and completeness violations, and other logical or cryptographic vulnerabilities.

Vulnerability classes covered include under-constrained / unconstrained signals, missing range and boolean checks, missing input bindings, comparator/field overflows, soundness and completeness breaks, nullifier/replay issues, and business-logic bypasses.

Files

train.jsonl: training examples (hand-written pedagogical circuits + audit-report-derived rows).
valid.jsonl: held-out validation examples.
synthetic.jsonl: additional synthetic examples.
zk_security.jsonl: examples derived from public security audit findings; every row carries a source object.

Data Sources

Audit-derived rows are built from the public zkbugs collection of ZK vulnerability reports and from public security advisories, issues, and pull requests. Findings are credited to their auditors (Veridise, ZKSecurity, Zellic, Trail of Bits, ABDK, yAcademy, Matter Labs, OpenZeppelin, HashCloak, Least Authority, NCC Group, Hexens, Inference, and independent researchers) in each row's source object. Where the original codebase is available the vulnerable code is used verbatim; otherwise the circuit is faithfully reconstructed from the finding's description, location, and proposed mitigation. Every audit-derived row includes a source object (name, link, protocol) so readers can open the underlying report.

License

Original dataset structure, synthetic examples, and editorial content are released under MIT.

Audit-derived rows are based on code from upstream projects and inherit their respective licenses as documented in the table below. Users must comply with the upstream license for any rows they use or redistribute. This dataset's only source-available (non-open-source) license is BSL-1.1 (see note on Panther Protocol).

Upstream code provenance & licensing

Each audit-derived example is based on code from the upstream project listed below. The license shown is the project's declared license; review and comply with it before redistributing or training on the corresponding rows.

Project	Upstream repository	License	Auditor(s)	Rows
Aptos Keyless	aptos-labs/keyless-zk-proofs	`GPL-3.0`	koukyosyumei	1
Arianee	Arianee/arianee-sdk	`MIT` (package.json)	Veridise	2
arkworks r1cs-std	arkworks-rs/r1cs-std	`MIT OR Apache-2.0`	arkworks	1
Blake3 Circom	banyancomputer/hot-proofs-blake3-circom	`MIT`	koukyosyumei	1
circom-bigint	0xbok/circom-bigint	`GPL-3.0`	Veridise	1
circomlib	iden3/circomlib	`LGPL-3.0`	Kobi Gurkan, Veridise	9
Filecoin rust-fil-proofs	filecoin-project/rust-fil-proofs	`MIT OR Apache-2.0`	Trapdoor Tech	1
gnark	Consensys/gnark	`Apache-2.0`	Consensys	1
Herodotus	HerodotusDev/offchain-evm-headers-processor	`GPL-3.0`	ABDK	7
iden3 circuits	iden3/circuits	`GPL-3.0`	Trail of Bits	1
Lurk	lurk-lab/lurk-rs	`MIT OR Apache-2.0`	Inference	5
MACI	privacy-scaling-explorations/maci	`MIT`	HashCloak	1
Neptune	lurk-lab/neptune	`MIT OR Apache-2.0`	Inference	1
OpenVM	openvm-org/openvm	`MIT OR Apache-2.0`	OpenVM	2
Panther Protocol	pantherfoundation/panther-core	`BSL-1.1` at audited commit → since relicensed `LGPL-3.0` (see note)	Veridise	12
Penumbra	penumbra-zone/penumbra	`MIT OR Apache-2.0`	NCC Group, ZKSecurity	4
Polygon zkEVM	0xPolygonHermez/zkevm-proverjs	`AGPL-3.0`	Hexens	2
Rarimo Passport	rarimo/passport-zk-circuits	`MIT`	koukyosyumei	1
RISC Zero	risc0/risc0	`Apache-2.0`	RISC Zero	3
RLN	Rate-Limiting-Nullifier/circom-rln	`MIT OR Apache-2.0`	Veridise	1
Scroll MPT	scroll-tech/mpt-circuit	`MIT`	Trail of Bits, Zellic	7
Scroll Poseidon	scroll-tech/poseidon-circuit	`Apache-2.0`	Zellic	1
Scroll zkEVM	scroll-tech/zkevm-circuits	`MIT`	Trail of Bits, Zellic	26
Self Protocol	selfxyz/self	`MIT` (circuits package; monorepo is per-package)	ZKSecurity	8
Semaphore	semaphore-protocol/semaphore	`MIT`	Veridise	1
Sismo Hydra-S2	sismo-core/hydra-s2-zkps	`MIT`	Veridise	1
SIV	siv-org/verifiable-private-overrides	`ISC` (package.json)	koukyosyumei	2
SP1	succinctlabs/sp1	`MIT OR Apache-2.0`	Succinct	6
Spartan-ECDSA	personaelabs/spartan-ecdsa	`MIT` (package.json)	yAcademy	3
StarkEx Perpetual	starkware-libs/stark-perpetual	`Apache-2.0`	ABDK	1
Summa	summa-dev/summa-solvency	`Apache-2.0`	Summa	1
Tangle Network	tangle-network/protocol-solidity	`MIT OR Apache-2.0`	Veridise	1
Telepathy	succinctlabs/telepathy-circuits	`GPL-3.0`	Trail of Bits, Veridise	6
UniRep	Unirep/Unirep	`MIT`	Veridise	2
WORM Proof of Burn	worm-privacy/proof-of-burn	`MIT`	koukyosyumei	1
ZK Email	zkemail/zk-email-verify	`MIT`	Matter Labs, ZKSecurity	5
Zkopru	zkopru-network/zkopru	`GPL-3.0`	Least Authority	2
zkSync Social Login	Moonsong-Labs/zksync-social-login-circuit	`MIT` (package.json)	OpenZeppelin	2

Notes:

Licenses for the Circom projects were read from the LICENSE/package.json at the exact audited commit the code was taken from; for the other projects the value is GitHub's detected SPDX license on the default branch. Where a project's license changed over time, the table reflects the version relevant to the code we used.
⚠️ Panther Protocol license change. At the audited commit the code we used was licensed under the Business Source License 1.1 (BSL-1.1), a source-available (not OSI-approved, not open-source) license with use restrictions. The project has since relicensed upstream to LGPL-3.0. The 12 Panther-derived rows reflect BSL-1.1-era code, so evaluate them against BSL-1.1 terms (the LGPL-3.0 relicensing does not necessarily apply retroactively to that commit). This is the dataset's only source-available (non-open-source) license.
MIT OR Apache-2.0 denotes dual-licensed projects (two LICENSE-* files); recipients may choose either license.
Sources without a clear, verifiable license were excluded from the dataset: Dark Forest (darkforest-v0.3, all rights reserved / review-only), Inference Labs Subnet (subnet-2-circom, no license declared), and Reclaim Chacha20 (circom-chacha20, repository no longer reachable). Rows derived from these projects have been removed from all *.jsonl files.
License identifiers are provided for convenience and are not legal advice.

Data Format

Each line in a JSONL file is one complete training example. Audit-derived rows include a source object; purely synthetic examples may omit it.

{
  "messages": [
    {"role": "system", "content": "You are a Zero-Knowledge Proof security auditor. You review ZK circuits and proof-system code across DSLs and frameworks (such as Circom, Halo2, gnark, Arkworks, Cairo, Plonky3, and RISC Zero) to identify under-constrained signals, soundness and completeness violations, and other logical or cryptographic vulnerabilities."},
    {"role": "user", "content": "Audit this circuit for vulnerabilities:\n\n```circom\n...```"},
    {"role": "assistant", "content": "Vulnerability: ...\n\nExplanation: ...\n\nFix: ...\n\n```circom\n...```"}
  ],
  "source": {
    "name": "...",
    "link": "https://...",
    "protocol": "..."
  }
}

The fenced code block in the user / assistant turns is tagged with the relevant language (circom, rust, go, cairo, ...) for non-Circom examples.

Deploying

To deploy a new version of the dataset, we use the following command:

hf upload mourningdove/zk-constraint-data . --repo-type=dataset

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ZK Constraint Dataset

Scope

Files

Data Sources

License

Upstream code provenance & licensing

Data Format

Related

Deploying

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
synthetic.jsonl		synthetic.jsonl
train.jsonl		train.jsonl
valid.jsonl		valid.jsonl
zk_security.jsonl		zk_security.jsonl

Folders and files

Latest commit

History

Repository files navigation

ZK Constraint Dataset

Scope

Files

Data Sources

License

Upstream code provenance & licensing

Data Format

Related

Deploying

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages