Experimenting with different adder implementations#7993
Draft
tautschnig wants to merge 432 commits into
Draft
Conversation
e86df44 to
1a8044c
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #7993 +/- ##
===========================================
- Coverage 80.46% 80.32% -0.14%
===========================================
Files 1704 1704
Lines 188665 188973 +308
Branches 73 73
===========================================
- Hits 151805 151801 -4
- Misses 36860 37172 +312 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
1a8044c to
1fd90bb
Compare
7 tasks
1fd90bb to
76757cc
Compare
…ound Beame and Liew (CAV 2017 / JACM 2019) proved polynomial-size regular resolution refutations exist for commutativity and any degree-2 ring identity on array, diagonal, and Booth multipliers, and quasi-polynomial (n^O(log n)) for Wallace tree multipliers. This refutes Biere's 2016 conjecture of exponential hardness. Our empirical CDCL data (N2 experiment, tab:n2) is still valid but the framing must change: what we observe is NOT an inherent proof-complexity lower bound -- it is the gap between the *existence* of short proofs (proven) and *CDCL's ability to find them* on standard encodings. Beame and Liew themselves identify this as an open engineering problem, calling for 'guiding information to add, either to the formulas derived from the circuits or to CDCL SAT solver heuristics'. Our encoding- selection approach is precisely such guidance. Reframing applied in six places: 1. Abstract: 'SAT hardness' -> 'CDCL search difficulty'; adds reference to Beame-Liew upper bound. 2. §1 contributions (#2): 'causal hardness driver' -> 'causal driver of CDCL search difficulty' with explicit note that proof-theoretic hardness does not hold. 3. §3 Carry Propagation Hardness: new paragraph 'The gap is between proof existence and CDCL search' with Beame-Liew quotation; removes the old 'grows exponentially' claim and Haken-1985 reference as misleading analogies. 4. §3 N2 observation (1): 'Carry presence is the dominant causal factor' -> 'Carry presence dominates CDCL search difficulty'. 5. §3 N2 observation (4) and Scope of the claim: rewritten to cite Beame-Liew upper bound and frame our result as CDCL-search-relative, not proof-complexity-absolute. 6. Conclusion: reframes the carry-propagation finding as CDCL search difficulty; replaces the N1-as-future-work paragraph with 'can CDCL be steered to find Beame-Liew's polynomial proofs without encoding guidance' as the open question. Also adds to Related Work: new 'Resolution upper bounds for ring identities' paragraph introducing Beame-Liew's construction, citing their open problem, and positioning our work as addressing it via encoding engineering. New bib entries: beame2017towards (CAV), beame2019toward (JACM). Paper 1: 21 pages, builds clean. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…ing-learned-clauses
Adds two more bibliography entries in the 'escape Resolution hardness'
line of work, and extends the Related Work and Conclusion to
contextualize our encoding-selection approach alongside them.
New bib entries:
- danner2025sat: Julian Danner, 'SAT Solving Using XOR-OR-AND Normal
Forms and Cryptographic Fault Attacks', PhD thesis, University of
Passau, April 2025 (reviewers: Kreuzer, Biere). Dissertation is
citable; it develops CDXCL/CDXCLlite, a CDCL generalization to
XNF, implemented as Xorricane/2-Xornado, provably escaping
Resolution lower bounds on XOR-rich instances.
- pollitt2026factoring: Pollitt, Battleman, Fleury, Vizel, Heule,
Biere, Bryant, 'Factoring Learned Clauses', SAT 2026 (to appear).
Revives Extended Resolution in modern CDCL: CaDiCaL-ER (GlucosER-
style definition introduction during conflict analysis) and
CaDiCaL-FX (BVA extended to XOR/ITE gate detection on learned
clauses), with polynomial-size ER proofs for Tseitin formulas.
Paper changes:
- Related Work: The 'Resolution upper bounds for ring identities'
paragraph (which discusses Beame-Liew 2017/2019) is now followed
by a new 'Escaping resolution hardness in CDCL' paragraph that
positions three complementary approaches:
(i) encoding-level guidance (our paper),
(ii) solver-level generalization to stronger proof systems
(Danner 2025, and related work on CDCL + XOR reasoning
with bidirectional Res(XOR) simulation),
(iii) inprocessing-level definition introduction
(Pollitt et al. 2026 CaDiCaL-FX).
- Conclusion: updated fourth future-work item to acknowledge
that two parallel research lines (solver-level generalization
and inprocessing-level factoring) are also active, and that
encoding-, solver-, and inprocessing-level guidance are
combinable.
The Beame-Sun 'CDCL(XOR)' paper (sat-paper90.pdf) is discussed
without citation, as it is accepted but not yet published; we
credit the broad line of work instead.
Paper 1: 21 pages, builds clean.
Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Authors (alphabetical by surname): - Armin Biere (Amazon Web Services and University of Freiburg) - Martin Brain (City St George's, University of London) - Daniel Kroening (Amazon Web Services) - Norbert Manthey (Amazon Web Services) - Michael Tautschnig (Amazon Web Services and Queen Mary University of London) Replaces the 'Michael Tautschnig + TBD' placeholder. Paper 1: 21 pages, builds clean. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
The revised research plan set reflects the discovery that Beame &
Liew (CAV 2017, JACM 2019) have polynomial-size regular resolution
proofs of degree-2 ring identities on array, diagonal, and Booth
multipliers - refuting the exponential-hardness conjecture that
underpinned the earlier N1 plan.
**N1 (rewritten)** - research-plans/N1-cdcl-vs-beame-liew-gap.md:
Replaces the exponential-lower-bound plan (now dead). New goal:
characterize why CDCL does not find Beame-Liew's polynomial
proofs on standard multiplier encodings, and under what formula-
level transformations it does. Three proposed approaches
(empirical, interpolation-theoretical, hybrid). Supersedes
N1-exponential-lower-bound.md, which is removed.
**N3 (new)** - research-plans/N3-beame-liew-implementation.md:
Plan to implement Beame-Liew's critical-strip construction as
an executable DRAT proof generator. Reviewer communication
indicated Beame's student attempted and failed; the failure
mode is undocumented. We propose phased implementation (single
strip -> multi-strip -> full refutation -> scaling experiments)
with heuristic extraction as a stretch goal. Orthogonal to N1
but could supply Approach A's gold-standard reference proof.
**N4 (new)** - research-plans/N4-multi-encoding.md:
Ablation design for a new encoding technique: encode the same
multiplication with two different encodings, shared inputs
and outputs, separate internal variables, output-equality
constraints. Implemented as 'secondary_encoding' string on
bv_utilst, wrapper at the top of unsigned_multiplier, env-var
CBMC_MULTI_ENCODING wiring in smt2_solver. Early sanity check
on comm_12 (algebraic disabled):
- shift-add alone: 533s (T/O-prone)
- comba-cs alone: 1.04s
- multi comba-cs + shift-add: 1.08s
suggests a portfolio-like effect: ~4% overhead vs the faster
single encoding, 500x speedup vs the slower one. Full
ablation sweep is running in the background; plan describes
three possible outcomes (positive / portfolio / negative)
and their writeup implications.
Implementation details:
- src/solvers/flattening/bv_utils.h: new member
'secondary_encoding' (std::string), getter/setter.
- src/solvers/flattening/bv_utils.cpp: wrapper at top of
unsigned_multiplier that, when secondary_encoding is
non-empty, computes both encodings and set_equal's outputs.
Handles shift-add/comba/comba-cs/dadda/dadda-cs/wallace/
booth/block4/sortnet as valid secondary names.
- src/solvers/flattening/boolbv.h: passthrough setter.
- src/solvers/smt2/smt2_solver.cpp: reads CBMC_MULTI_ENCODING
env var in configure_encodings.
Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Mirror the cross-cutting notation and clarity fixes made in the co-authoring repository (~/multiplier-encodings.git) so the two versions of Paper 1 stay close in substance even while Martin's intro rewrite diverges. Changes applied here: - 'bitwidth' -> 'bit-width' (15 occurrences) for notation consistency. - 'partial product matrix / array' -> 'partial-product grid' throughout. - Expanded Comba description in Background to explicitly explain why the column-wise + immediate-carry combination limits inprocessing. - Rewrote E0-E4 variant descriptions in §3 with explicit accumulation-rule pseudo-code, plus a preamble clarifying that all five variants share identical partial products and differ only in the accumulator. - Added kroening2016decision and barrett2021smt bib entries (as canonical references for bit-blasting). Paper 1: 21 pages, builds clean. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
N4 ablation sweep completed: 9 benchmarks x 4 primary encodings x 7 secondary encodings (including singleton baselines) = 216 measurements on CaDiCaL with algebraic pre-solver disabled, 60s timeout. Headline finding: multi-encoding exhibits PORTFOLIO-LIKE BEHAVIOR. The solver receives both encodings of the same multiplication with shared inputs and outputs, and performs approximately as a free portfolio of the two singletons: - Median multi/min(A,B) = 1.31x (30% overhead vs best singleton) - Geom mean multi/min(A,B) = 1.23x - Median multi/max(A,B) = 0.14x (7x faster than slower singleton) - 42% of multi runs within 1.25x of min(A,B) - 40% more than 1.5x slower than min(A,B) -- trade-off is real - 44 'rescue' cases where one singleton T/Os but multi solves New Paper 1 subsection 5.4 'Multi-Encoding Combinations' presents the implementation, the ablation design, and the portfolio finding with a headline table. Places the technique alongside the adaptive encoding-selection heuristic in §5 as a complementary strategy when pattern recognition is not reliable. Updates N4 research plan to 'Executed' status with the headline statistics and the verdict (Outcome N: portfolio-like). Raw data: data/multi-encoding-results.tsv (216 measurements). Paper 1: 22 pages, builds clean. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Applied ~17-20 word-form replacements per paper via a conservative
Python script (tmp/us-to-uk.py):
- -ize -> -ise family: normalize, characterize, mechanize,
generalize, optimize, recognize, minimize, maximize, formalize,
factorize, organize, emphasize, utilize, hypothesize, specialize,
initialize, standardize, realize, summarize.
- -ization -> -isation (where applicable).
- -yze -> -yse: analyze, catalyze.
- -or -> -our: behavior, favor (present in 1 location).
- -el -> -ell doubling: modeled -> modelled, labeled -> labelled
(not present in our text but included for safety).
- US catalog -> UK catalogue (1 location, already British from
earlier edits).
LaTeX environment and command contexts preserved (negative
lookbehind for backslash ensures e.g. \color{...} in todo macros
and \begin{itemize} are untouched). Technical CS terms kept:
'program' (software sense) stays in 'program verification';
'size' stays as is.
Also corrected the three 'Mechanized' (capitalised) missed by
the initial pass.
Paper 1: 22 pages, builds clean. Paper 2: 20 pages, builds clean.
Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Mirrors the senior-reviewer fix commit d4f06dc from multiplier-encodings.git. Same substantive changes: - Harmonise BVE story in §5.1 (three-regime breakdown). - Fix broken table/section references (Table 4 -> Table 10; Section 6 below -> Table tab:best). - Fix DRAT size for strength_chain_16 (10 MB -> 3.3 MB). - Remove companion-algebraic citations (Paper 2 not in submission state yet; do not cite). - Clarify multi-encoding T/O accounting. - Soften 'Proof-guided methodology' -> 'Proof-guided encoding design' with explicit scope. - Fold §6.5 Multi-Encoding into §5.2 Multi-Encoding as Hybrid Selection. - Remove stale 1728-configuration sweep. - Comment out §4 'Development effort' paragraph. - Add benchmark-pool intro to §7; add missing labels. Paper 1: 22 pages, builds clean. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Mirrors commit from multiplier-encodings.git. Body reduced from 19 pages to 16 pages by moving supplementary tables to the appendix and compressing prose in §3 (N2 variants and observations), §4 (automation paragraph), §6.1 (Booth), §7.10 (SMT-COMP), and §8 (Negative Results). No claims removed. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Same four fixes as in multiplier-encodings.git e775198: - Figure 1 Pass 1/Pass 2 labels moved below their arrows (were above, colliding with 'Carry bits:' and 'Result:'). - Adaptive-heuristic ablation table moved to a new appendix, with a compact summary retained in the body. - Conclusion + future-work condensed: single summary paragraph + single open-directions sentence, down from ~45 lines to ~25. (No duplicate subsection heading in this repo's version, so that fix is not mirrored.) Main body now ends on page 16-17 (21 pages total). Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Same six fixes as multiplier-encodings commit 7xxx: 1. Reworded three Martin Brain 'private communication' references to 'earlier unpublished work by one of the authors' (Martin is now a co-author). 2. Redesigned Figure 1 with header-style section labels above each block group, inline 'sums'/'carries' labels, and Pass labels properly spaced above arrows. 3. Folded the one-sentence DRAT lead-in into the ablation paragraph. 4. Removed 'A methodological caveat: encoding flags must reach the solver' paragraph (internal engineering observation, no scientific value to this audience). 5. Merged three-paragraph §7 opening into a single paragraph. 6. Tightened paragraphs with 1-2 word orphan last lines. Main body ends on page 16; total 21 pages. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Mirrors multiplier-encodings.git commit: adds Wallace as an eighth encoding to the comparison. Key finding (Wallace does NOT track Dadda, differs by up to 14x on commutativity) applied to: - Contribution bullet (six- -> eight-encoding) - Section 2 Background (Wallace alongside Dadda) - Figure 2 scaling plot (seven encodings including Wallace) - Appendix D tab:proof-sizes (eight encodings including Wallace) Same data: - Wallace proof sizes on comm_11/13 are 94 MB / T/O (vs Dadda 6.5 MB / 98.6 MB) - dramatically worse - Wallace scaling times 18.6x slower than Dadda at BW=11 - Wallace tracks Dadda on non-commutativity benchmarks (mul_ineq, overflow, strength reduction: within 2x) Paper still 21 pages; body ends page 16. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Mirrors multiplier-encodings.git commit to Paper 1 plus related Paper 2 update. Paper 1 (doc/paper-bitblasting/paper.tex): Same 10 review items (M1-M10) plus cut/compress/polish as the multiplier-encodings commit. Body reduced from 19 to 14 pages. Paper 2 (doc/paper-algebraic/paper.tex): - Replace the short 'two classes' taxonomy (universal/equational vs existential) with the full 4-class taxonomy (universal vs existential, crossed with equational vs relational) that moved here from Paper 1. The full taxonomy is a more natural fit for Paper 2 since the algebraic techniques cover the universal equational quadrant specifically. - Remove duplicate 'bit-blasting companion paper' citation that now only needs to appear once. New SMT-COMP isolation data files: - doc/paper-bitblasting/data/smt-comp-combacs-nosimp.tsv (66 benchmarks, smt2_solver, combacs encoding, no simplify/algebraic) - doc/paper-bitblasting/data/smt-comp-shiftadd-nosimp.tsv (same, shift-add encoding) Both configurations solve 41/66 with identical PAR-2 (52.1 s); per-benchmark ratio on both-solved between 0.83x and 1.15x, confirming the encoding alone is neutral on a broad community sample. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Mirrors multiplier-encodings.git author-review tweaks 3-5 (the §2
grammar fix did not apply here because cbmc-github had different
phrasing already) and applies the same M1 treatment to the intro
that was applied to Paper 1 in the previous commit on the other
repo.
- Remove the shorter 2-class taxonomy from cbmc-github intro to
match multiplier-encodings (full 4-class taxonomy now lives in
Paper 2 intro; a single-sentence segue remains in Paper 1 §10).
- 'An influential line of work' -> 'Beame and Liew'.
- Fix double-hyphen ligature in \texttt{--refine-arithmetic}.
- Spell out Brent-Kung [BK] at first use.
Both papers verified: --refine-arithmetic renders as two hyphens
in the output, Brent-Kung is spelled out in §8.2 (not only in
Appendix A).
Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Mirrors multiplier-encodings.git: adds Burch 1991 (early BDD-based multiplier verification), Kumar et al. 2023 (BDD with partial-product variables, O(n^3) size up to 1024 bits), and Dutertre 2020 (empirical evaluation of 16 SAT solvers on QF_BV CNFs finding CaDiCaL superior). Same integration points: Burch and Kumar in §9 'Algebraic verification'; Dutertre in §7.3 'Cross-Solver Analysis'. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Same fix as multiplier-encodings.git: restore the five per-row mechanistic explanations of which encoding wins on which pattern, plus the adaptive-selection consequence. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Captures the research-plan status (N1/N2/N3/N4), paper status, and recommended prioritisation after the senior-review passes and the Wallace integration. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…ommutativity
Implements the first prototype for research plan N3 (Beame-Liew
critical-strip construction).
What works:
- bench-multiplication/n3-beame-liew/generate_array_mul_comm.py:
self-contained DIMACS generator for a*b = b*a on an n-bit array
multiplier. Validated UNSAT at n = 1..11 by CaDiCaL and minisat.
- bench-multiplication/n3-beame-liew/beame_liew_phase1_v2.py:
emits a DRAT refutation by
(i) 2^(2n) leaf 'branch-cut' RUP lemmas (one per (a, b)
assignment),
(ii) a binary-resolution tree over the 2n input variables,
(iii) the empty clause.
The proof validates with drat-trim at every tested n (1..11).
- On the same CNFs, the prototype's DRAT proof is 4-10x smaller
than CaDiCaL's DRAT output at n <= 8 and remains smaller at
n = 9..10 (where CaDiCaL's baseline is itself incomplete at
the 60s timeout).
n BL DRAT CaDiCaL DRAT ratio
2 307 5,002 0.061x
4 9,987 41,560 0.240x
6 266,229 994,916 0.267x
8 6.1 MB 26.4 MB 0.230x
10 127 MB 213 MB 0.596x
What this is NOT:
This is not the polynomial-size O(n^6 log n) critical-strip
refutation from Beame & Liew 2019. The prototype is O(2^(2n)
* poly(n)), dominated by the flat enumeration. Phase 2-3 of
the N3 plan would replace the 2^(2n) leaves with a read-once
branching program of size poly(n) exploiting the
Delta = log(2n) strip decomposition.
Even so, the prototype establishes that:
(i) the Phase 1 infrastructure works and drat-trim accepts
the output;
(ii) a structured proof (enumerate first, resolve second) can
already beat CDCL's proof size at small n by a constant
factor, suggesting proof structure itself is a useful
piece of guidance.
Also updates research-plans/N3-beame-liew-implementation.md with
a Status section summarising the above.
Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Extends the N3 work from the previous Phase 1 commit with a structural Phase 2 prototype, a BDD sanity check, and an honest writeup of why the polynomial Beame-Liew bound remains out of reach within the time budget of one session. New artefacts under bench-multiplication/n3-beame-liew/: - generate_array_mul_comm_meta.py: CNF generator that tags every variable with its role (input bit, partial product, accumulator, carry, output, diff). Required for any strip-level analysis. - beame_liew_phase2_v2.py: produces a column-structured DRAT refutation. For each output bit k in [0, 2n) the generator emits a Phase-1-style resolution tree terminating in the unit -diff[k], then UP on the diff-OR clause gives the empty clause. Validates with drat-trim at n = 2..7. The proof is strictly LARGER than Phase 1's flat enumeration by ~2n (empirically 5.6x at n=2 -> 15.8x at n=7), confirming that per-column decomposition alone cannot beat flat enumeration: each column sub-proof has a different suffix literal and cannot share the input-enumeration tree. - phase2_bdd_probe.py: builds BDDs for c[k] and d[k] using the dd Python package and confirms structural equality as a sanity check that the array multiplier commutativity CNF is correctly encoded. Reports per-bit BDD DAG sizes. Empirical BDD sizes match Bryant 1991: peak BDD DAG size 462 at n=6, 4,419 at n=8, 41,148 at n=10. A BDD-based DRAT translation cannot yield a polynomial proof either. Key technical findings, documented in the README and in the N3 plan's Status section: 1. Unit propagation alone cannot derive -diff[k] from partial-input strip refutations (UP does not resolve biconditionals). 2. Per-column DRAT lemmas cannot share the 4^n-leaf input enumeration because their suffix literals differ. 3. BDDs of multiplier outputs are exponential (Bryant 1991). 4. The polynomial Beame-Liew bound requires extension variables in the DRAT proof (the branching program branches on internal tableau variables and encodes shared structure via RAT steps, not RUP). Phase 3 work -- branching program construction + RAT translation via Krajicek's Prop. 2.6 -- remains open. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Two quick follow-up investigations after the Phase 2 prototype:
Option B (trimmed-core comparison):
- Extracted CaDiCaL's DRAT proof core using drat-trim -l
- Compared against Phase 1's proof size
- Finding: even after trimming, CaDiCaL's core is 2.5-4x larger
than Phase 1 at n >= 5 (0.31x-0.40x ratio, stable across n)
- Conclusion: Phase 1's advantage is structural (ordered
enumeration + explicit resolution tree), not a CDCL-noise
artefact that trimming could fix.
- Raw data in data-options-ba.tsv; writeup in options-B-A.md.
Option A (RAT extension-variable test):
- Wrote rat_test.py: 4-clause UNSAT with a RAT-introduced
extension variable x3 := x1 AND x2 used in deriving empty.
- drat-trim reports s VERIFIED for the proof.
- Confirms drat-trim accepts RAT-format clauses introducing
extension variables, so the toolchain supports the proof
format needed for a future polynomial Phase 3 construction.
- (The trimmed report shows '0 RAT lemmas in core' because RUP
derivations succeed for these particular lemmas on a small
formula; RUP is a special case of RAT, so this is expected
drat-trim behaviour.)
Together these clarify (a) that Phase 1's empirical win is robust
to trimming, and (b) that the proof format permits the
extension-variable encoding that a genuine polynomial Phase 3
proof would need.
Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Downloaded the 29-page arxiv v2 of Beame-Liew 'Towards Verifying Nonlinear Integer Arithmetic' (1705.04302v2) which contains the full critical-strip construction (§3, §4) referenced in the N3 research plan. Needed for Phase 3 implementation. Key construction details (§3.3): - The refutation is a read-once branching program (Proposition 2.1 translates BP to ordered resolution directly). - For each k in [0, 2n], the BP branches on the 'first-disagreeing' output bit being bit k. - Each strip phi_Strip(k) is the sub-formula with tableau variables in columns [k-Delta, k] where Delta = log n. - The per-strip BP branches on inputs row-by-row, merges on 'Cut(j)' states of size O(log k). - Size bound: O(k^5 log k) per strip (Corollary 3.3), O(k^6 log k) total. No RAT / extension variables are needed -- the BP branches on existing CNF variables, and merging at Cut(j) nodes gives the polynomial size. Option A's RAT-readiness is therefore not needed for the actual construction. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Implements extraction of phi_Strip(k) from the full commutativity CNF per Beame-Liew 2017 §3.3 Definition. A clause is in the strip iff it references a column-indexed variable (pp, acc, cry, or diff) whose column is in [k-Delta, k]. Plus: - tableau symmetry clauses pp_c[i][j] = pp_d[j][i] from the preprocessing step in the paper; - ZERO-constant unit clauses (without which the carry-chain starts from an unconstrained value and the strip becomes trivially SAT). Validation: at n=4, 6, 8 for k=1..2n-1, the extracted phi_Strip(k) is UNSAT (verified by CaDiCaL). This matches Beame-Liew's Lemma 3.1. Note: early attempts omitted (a) non-tableau carry/acc clauses and (b) ZERO-constant units, producing SAT strips. Both were bugs in the extraction, not in the paper. Step 1 complete. Step 2 (per-strip BP construction with Cut(j) merging) is the next milestone. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…nential Built BDDs for all strip variables using the dd package to measure the BP size we'd get if the BP branched in input-variable order. Results confirm that naive BDD-based BPs are exponential for middle-k strips, matching Bryant 1991: n=4, k=5: peak BDD 55 n=6, k=7: peak BDD 462 n=8, k=9: peak BDD 3,537 n=8, k=11: peak BDD 4,419 The Beame-Liew O(n^6 log n) polynomial bound relies on a SPECIFIC variable ordering (outputs o^yx first, then tableau row-by-row, with merging at Cut(j) states). Our input-variable ordering does not achieve polynomial size. Files: - phase3_bdd_size.py: measures BDD sizes per strip variable. - options-B-A.md updated with the strip BDD-size table and an honest explanation of why Step 2 requires the paper's exact variable ordering (not yet implemented). The polynomial construction remains the open Phase 3 work. phase3_bp.py was removed (it contained scaffolding for the wrong approach; will rebuild when implementing the paper's BP order). Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Rather than hand-implementing Beame-Liew's exact BP construction (which requires weeks of careful work on the Cut(j) merging structure), prototype the strip decomposition by: 1. Extract phi_Strip(k) for each k in [1, 2n-1]. 2. Run CaDiCaL on each strip to get a DRAT refutation. 3. Compose via top-level case split on k. Empirical result (bench-multiplication/n3-beame-liew/data-phase3-perstrip.tsv): n PerStrip Phase 1 CaDiCaL full PS/P1 PS/CaD 4 137 KB 10 KB 42 KB 13.68x 3.29x 6 1.2 MB 266 KB 995 KB 4.58x 1.23x 7 4.6 MB 1.3 MB 5.2 MB 3.56x 0.88x 8 22 MB 6.1 MB 26 MB 3.55x 0.82x Per-strip CaDiCaL beats full-CNF CaDiCaL at n >= 7 (0.82x at n=8, ratio improves with n). So structurally decomposing the proof into strips helps a SAT solver, but per-strip is still strictly worse than Phase 1's ordered case-analysis enumeration (3.55x larger at n=7, 8). Obtaining Beame-Liew's O(n^6 log n) polynomial would require hand-implementing the BP with the paper's specific variable ordering and Cut(j) merging. That is documented as future work in options-B-A.md; not implemented in this session. Summary update in options-B-A.md enumerates what's committed across Phase 1/2/3 and what remains. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…nance
Following Armin's review (widen corpus before revisiting
remaining work), acquired the full SMT-LIB 2024 QF_BV release
(Zenodo 10.5281/zenodo.11061097): 46,191 benchmarks, of which
15,435 contain bvmul and are declared unsat.
Headline finding: triaging the full corpus surfaced the 'float'
family (FP-as-BV, Haller-Griggio-Brain-Kroening FMCAD 2012),
absent from our 66-benchmark sample, where our algebraic solver
DOMINATES due to wide-multiplier structure (53x53->106-bit
significand products):
On 75 float bvmul-unsat benchmarks @ 60s (low-contention):
ours 68 solved
Bitwuzla 51 solved
cvc5 6 solved
18 CONFIRMED unique wins (standalone: neither Bitwuzla nor
cvc5 solves within 60s; all declared unsat; our verdict
agrees with Bitwuzla everywhere both solve -> sound).
This adds 18 to the paper's previous 4 wins-beyond-all-solvers.
Methodological note recorded in RESULTS.md: a first pass at
36-way parallelism (= core count) manufactured spurious
timeouts via contention (newton.3.3.i 'won' at 36-way but
Bitwuzla solves it standalone in 31s). All win claims
re-confirmed at 1-6 way parallelism. Evaluation should run at
<= cores/4 way.
Committed (repo kept lean; full 35GB corpus stays in staging):
bench-multiplication/float-fp2bv/
pow5.smt2, mul_03_30_4.smt2 -- two smallest clean wins
MANIFEST.tsv -- sha256+size for all 75 (re-fetch)
results-{ours,bitwuzla,cvc5}.tsv -- full per-benchmark evidence
run-any.sh, run-corpus.sh -- parallel evaluation harness
RESULTS.md -- writeup + reproduction steps
Also updated doc/paper-algebraic/benchmark-sources.md with the
acquisition status and the AIG->SMT-LIB converter blocker for
the Tier B Konrad gate-level archives.
Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
The float-family discovery reshapes the priority list:
- Add Item 11 'Exploit the float (FP-as-BV) family' as NEW HIGH
PRIORITY. 18 unique wins-beyond-all-solvers on a canonical
third-party benchmark family (quadruples headline wins 4->22).
Four sub-tasks: adopt into eval corpus, diagnose 7 misses,
write into paper, probe rest of FP. Sub-tasks 1+3 need no new
code (data already collected).
- Rewrite Recommended sequencing into four tiers:
Tier 1 (now): Item 11.1+11.3 -- adopt float + write paper
result. ~2.5 days, biggest headline gain, zero new code.
Tier 2: Item 11.2 (diagnose misses) + Item 10 Phase 1
(fragment characterisation) + Item 3 (Sage2 scalability,
now better quantified: 24 unique fails in stratified run).
Tier 3: Item 10 Phase 2 gate extensions (SBIF, vanishing
removal), prioritised by the miss diagnosis.
Tier 4: Item 7 gate-level (Konrad AIGs downloaded, awaiting
an AIG->SMT-LIB converter).
- Update top-of-file status with the +18 float wins.
Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
A co-author asked whether FP operations actually traverse our
algebraic path (we use float_utilst, not float_bvt). Checking
this disproved the previous commit's float-family claim:
1. The float benchmarks are PURE QF_BV -- no Float sort, no fp.*
operators (the fp.iN tokens are just variable names declared
(_ BitVec N)). So no FP lowering runs; float_utilst/float_bvt
is irrelevant here.
2. Ablation (DISABLE_ALGEBRAIC=1) shows the algebraic path is
NOT responsible for the float 'wins' -- and is net overhead:
pow5 1.28s ON == 1.28s OFF
test_v7_..s703 31.2s ON -> 15.3s OFF (FASTER without algebra)
newton.3.3 8.3s ON -> 6.7s OFF
All still solve unsat with algebra OFF via bit-blasting +
MiniSAT (pow5: 171 SAT conflicts). The wins over Bitwuzla/cvc5
are a SAT-backend artifact, not evidence for the algebraic
method.
Corrections:
- RESULTS.md: prominent CORRECTION block + retraction note on
the headline. Data retained for the record.
- benchmark-sources.md: headline corrected.
- remaining-work.md: top status corrected (count stays 4, not
22); Item 11 marked RETRACTED with the ablation evidence.
New Item 12 (HIGH PRIORITY, Tier 0): ablation re-audit of ALL
claimed algebraic wins. The float episode showed we lacked
ablation discipline -- a 'win' only supports the thesis if it
survives DISABLE_ALGEBRAIC=1. Demonstrated the audit
distinguishes genuine from artifact:
SABER n16-bw13: 0.02s ON vs 60s TIMEOUT OFF -> GENUINE
assoc_64: 0.01s ON vs 0.01s OFF -> ARTIFACT
Recommended-sequencing reordered: Item 12 is now Tier 0,
gating credibility of all headline numbers before submission.
The corpus widening itself (full QF_BV acquired + triaged,
harness built) remains valid; only the float attribution is
retracted.
Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…S BUGS
Ran every paper claim algebra-ON vs DISABLE_ALGEBRAIC=1, single-
threaded, cross-checked against z3/Bitwuzla/declared-status.
GENUINE (survive ablation, sound):
- 4 Brain wins-beyond-all-solvers: NEEDS_ALGEBRA, sound
- 5 SMT-COMP unlocks (cohencu_0..3, geo3.c_5): NEEDS_ALGEBRA
- 41/66 = 31 bit-blast + 10 algebra (restate in paper as 10/31)
- SABER: 26/31 NEEDS_ALGEBRA + 1 ALGEBRA_FASTER (cleanest)
- DSP mac_equiv: ALGEBRA_FASTER (0.14s vs 45.8s)
No unsoundness in the UNSAT direction.
CRITICAL FINDING - 2 soundness bugs (algebra says SUCCESSFUL,
bit-blast+z3 say FAILED; the dangerous direction):
- assoc.c (BW=9): assertion operands cast to signedbv[32] (C
integer promotion); extractor reasons in 9-bit ring, ignores
the cast, proves a false-at-32-bit identity.
- mul_zero_factor.c (BW=8): {a*b=0, a!=0, b!=0} is SAT over
Z/2^8 (zero divisors: 16*16=0) but algebra refutes it
(assumes integral-domain/field; no zero divisors).
div_roundtrip = ARTIFACT (bit-blast also fast).
Added Item 13 (CRITICAL, blocks submission) with fix directions
for both bugs + verification plan + Lean reconciliation note.
Item 13 is now Tier 0 ahead of everything. Custom-suite
'39/39 validates correctness' claim must be withdrawn.
Artifacts: doc/paper-algebraic/ablation-audit-2026-06-01.md;
bench-multiplication/ablation-audit/{smtcomp66,brain-wins,saber,
custom-c}.tsv + ablation{,-cbmc}.sh harness.
Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
The Item 12 ablation audit found two cases where the algebraic
pre-solver reports VERIFICATION SUCCESSFUL when the correct answer
(per bit-blasting and z3) is FAILED -- claiming no bug where a bug
exists. Both are fixed by conservative guards in
boolbvt::try_algebraic_solve that fall back to bit-blasting on the
unsound patterns; genuine algebraic wins are unaffected.
Bug A -- widening cast of a defined intermediate (assoc.c).
C integer promotion encodes `(a*b)*c == a*(b*c)` on 9-bit operands
as cast(ab,signedbv[32])*cast(c,signedbv[32]) == ..., where ab is
the SSA symbol defined by `ab = a*b` truncated to 9 bits. The
single-ring extractor pinned the ring at 9 bits (from the defining
equations) and silently absorbed the 32-bit cast, treating ab as
the exact 9-bit product inside a 32-bit multiply -- so it "proved"
the (9-bit-only) associativity identity. Fix: drop any
(dis)equality in which a widening typecast of a defined-intermediate
symbol is a DIRECT operand of arithmetic (* + - unary-). This is
precise: widening a primary INPUT (mul_overflow's cast(a,32)*cast(b,32))
is a faithful zero/sign extension and is kept; a widen-then-narrow
(matrix_mul's cast(cast(c00,32),8)) is not an arithmetic operand and
is kept; SMT-LIB benchmarks carry no ID_typecast and are untouched.
Bug B -- zero divisors over ZMod(2^d) (mul_zero_factor.c).
A disequality x!=0 is refuted via the Rabinowitsch trick x*e-1=0,
which asserts x is a UNIT. Over a field that equals x!=0, but over
ZMod(2^d) a non-zero element need not be invertible, so
{a*b=0, a!=0, b!=0} (SAT via 16*16=0 mod 256) was wrongly refuted.
Fix: detect the zero-divisor shape -- a product (directly, or via an
SSA symbol definition) constrained to 0 whose two factors are each
separately constrained non-zero -- and skip algebraic solving.
Genuinely-unsat refutations (e.g. cohencu) carry no such pattern.
Verification (ablation re-audit, algebra ON vs DISABLE_ALGEBRAIC=1):
- assoc.c (BW=9) and mul_zero_factor.c (BW=8): now FAILED, matching
bit-blast + z3.
- No regression: SMT-COMP 66 identical (31 ARTIFACT / 25 BOTH_TO /
10 NEEDS_ALGEBRA, all 10 unlocks preserved); SABER identical
(26 NEEDS_ALGEBRA + 1 ALGEBRA_FASTER); 4 Brain wins still
NEEDS_ALGEBRA; custom C wins (mul_overflow, comm_64, mac_equiv,
matrix_mul, hash_mul, distrib, mul_square) still SUCCESSFUL and
fast (algebra preserved).
- groebner unit tests 11/11 pass; smt2_solver regression passes.
Added regression tests (expected FAILED) so neither bug can
silently return:
regression/cbmc/algebraic-soundness-assoc/
regression/cbmc/algebraic-soundness-zero-divisor/
Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
… 14) After the two targeted Item 13 fixes (assoc, mul_zero) were committed, a broad scan ran the solver (algebra ON) on 3,677 SAT bvmul benchmarks from the full QF_BV corpus and flagged any returned `unsat` (the dangerous direction). Result: 76 wrong `unsat`. Splitting by DISABLE_ALGEBRAIC=1: - 41 ALGEBRA-CAUSED (Sage2 x40, sage x1): genuine unsoundness in our algebraic pre-solver, NOT fixed by the two targeted guards. - 35 PRE-EXISTING (all float): wrong `unsat` even with algebra OFF -> a pre-existing CBMC bit-blaster/SMT2 front-end issue, outside the algebraic pre-solver; flagged separately. Root cause is systemic: the Rabinowitsch unit-trick (diff != 0 |-> diff*e-1 = 0, "diff is a unit") is unsound over ZMod(2^d) since non-zero != unit, so UNSAT_rabinowitsch does not imply UNSAT_original. Used at >=3 sites (__rabinowitsch, __rab, __rab_disj), plus a second distinct mechanism (is_zero()/gb-on- equalities) accounts for 4 residual that stay wrong even with all Rabinowitsch sites gated. Feasibility (experimental DISABLE_RABINOWITSCH gate, reverted): gating the 3 Rabinowitsch sites fixes 37/41; SABER + 4 Brain wins survive (sound vanishing path); but cohencu_0..3 + geo3.c_5 regress to timeout (they depend on Rabinowitsch); 4 residual remain. So a naive sound-only mode regains soundness at the cost of ~5 SMT-COMP unlocks and still needs the second mechanism fixed. Docs: - ablation-audit-2026-06-01.md: ADDENDUM with the full scan, root-cause analysis, feasibility data, design options. - remaining-work.md: Item 13 -> PARTIAL (down payment); new Item 14 (CRITICAL, blocks submission) with the systemic analysis, three design options (sound-only ideal-membership; unit-aware Rabinowitsch; Lean reconciliation), and Tier-0 re-prioritisation. - bench-multiplication/ablation-audit/sat-scan-*.txt: the 41 algebra-unsound and 35 pre-existing benchmark lists. The two committed targeted fixes remain correct and tested but are only a down payment. The algebraic pre-solver is unsound by default over ZMod(2^d) on ~41 real benchmarks; the paper's soundness claim cannot stand until Item 14 is resolved. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Answers the concern: we have a Lean soundness development yet the audit found unsoundness. The proofs are correct; both bugs live OUTSIDE the verified boundary at the encoding seam: - Bug A: the Lean model BVExpr is single-width; cast is same-width identity; the model cannot represent a width-changing cast. The C++ documented this (ASSUMES: casts do not change outer width) but the 'MAINTAINED BY' predicate (set_bitwidth ) did the opposite -- it absorbed wider casts into the narrow ring. Documented assumption, not enforced. - Bug B: the Rabinowitsch disequality encoding (diff!=0 |-> diff*e-1=0) has NO // PROOF: annotation, NO TRACEABILITY row, and is NOT in the BVExpr model (no disequality constructor / no e-variable). It is false over ZMod(2^d) (asserts diff is a unit). GroebnerSoundness.lean even contains two_not_isUnit but filed it under 'incompleteness', never stating the soundness bridge that would have required an IsUnit hypothesis. Why traceability missed it: the bi-directional audit checked that every // PROOF: ref resolves and every row maps to code -- but only over the ANNOTATED set. An unannotated soundness-critical step is invisible to both directions. Item 14 fix will also close the gap: (1) mechanise the disequality-encoding soundness obligation (which fails to prove the unconditional Rabinowitsch step -> forces ideal-membership / IsUnit side-condition); (2) enforce the single-width assumption in code; (3) add a coverage check so every soundness-critical step must be annotated (fail CI otherwise). doc/paper-algebraic/proof-gap-analysis-2026-06-01.md Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…tsch)
The algebraic pre-solver wrongly reported unsat on 41 real SMT-LIB
SAT benchmarks (broad-scan finding behind Item 13). Root cause: the
Rabinowitsch unit-trick is unsound over ZMod(2^d). Encoding diff != 0
as diff*e - 1 = 0 asserts diff is a UNIT; over ZMod(2^d) a non-zero
element need not be invertible (e.g. 2 mod 4), so UNSAT of the
Rabinowitsch system does NOT imply UNSAT of the original disequality
query. Diagnosis showed 38/41 via the three Rabinowitsch sites and a
further 3 (Sage2/bench_{12880,13209,5552}) via a tree-walk leaf
equality whose bit-decomposition operators (bitand/lshr) overconstrain
the equality system F.
Fix (src/solvers/flattening/boolbv.cpp::try_algebraic_solve):
- Remove the Rabinowitsch unit-trick at all three sites (the
__rabinowitsch push to the combined system, the single __rab
per-disequality pass, and the disjunctive __rab_disj pass).
- Promote the sound ideal-membership refutation to default-on: after
Buchberger, reduce each diff = lhs-rhs modulo the Groebner basis of
the equalities F; remainder 0 => diff in <F> => diff = 0 on every
common zero of F => the disequality is UNSAT. Sound over any
commutative ring (no unit assumption). The per-disequality vanishing
test is retained (also sound).
- Add a contains_bit_decomp filter so tree-walk leaf equalities
containing bitand/bitor/bitxor/lshr/ashr/bitnot are not fed into F
(sound: underconstraining F never turns SAT into UNSAT).
- Annotate every UNSAT-conclusion site with a // PROOF: reference.
Proof gap closed:
- formal-proofs/DisequalityRefutation.lean (zero sorry):
diseq_refutation_sound / diseq_refutation_unsat (ideal membership =>
sound refutation, any commutative ring) and
rabinowitsch_unsound_over_zmod (concrete Z_4 witness: 2 != 0 yet no
e with 2e = 1), plus ideal_membership_declines_witness.
- scripts/check_proof_traceability.py: fails if any // PROOF:
reference does not resolve to a real Lean theorem, or if any UNSAT
conclusion in try_algebraic_solve lacks a // PROOF: annotation. On
first run it caught four pre-existing DRIFTED references (renamed
theorems / wrong file) that the manual audit had missed; those are
corrected here (smt2_parser, poly_ring, poly_extract, boolbv).
- TRACEABILITY.md rows added.
Verification:
- Broad SAT scan (3,677 SAT bvmul benchmarks): 0 non-float wrong-unsat
(was 41). The 35 float wrong-unsat are pre-existing (reproduce with
DISABLE_ALGEBRAIC=1) -- separate CBMC bit-blaster/SMT2 issue.
- SABER 4/4 and Brain 4/4 preserved; cohencu_0/1 preserved via the
sound ideal-membership path (Bitwuzla confirms unsat).
- Lost: cohencu_2, cohencu_3, geo3.c_5 (3 SMT-COMP unlocks) -- the
accepted soundness cost. SMT-COMP NEEDS_ALGEBRA 10 -> 6.
- groebner unit tests 11/11; assoc/mul_zero still FAILED; new
regression test regression/smt2_solver/algebraic-rabinowitsch-
soundness passes; coverage check passes; build clean, no warnings.
Outstanding (Item 14b, docs): the paper still presents and claims
soundness for the Rabinowitsch trick (paper.tex Theorem proof sketch,
contributions table, disequality prose) and must be rewritten.
Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…er rewrite Traced where the (now-removed) Rabinowitsch trick came from, to inform the Item 14b paper rewrite. It was not invented here: it is the classical Nullstellensatz device (f != 0 |-> exists y. y*f = 1), adopted directly from the explicit future-work suggestion of the paper we build on, Song et al. 2024 (Equational Bit-Vector Solving via Strong Groebner Bases), Section 7: "extend our methods to more bit-vector arithmetics, such as bitwise operations and inequalities by, e.g., the Rabinowitsch trick [19, Ch. 4.2, Prop. 8], [29]." Both references they cite are FIELD settings where the trick is sound: [19] Cox-Little-O'Shea, Ideals, Varieties, and Algorithms, Ch. 4 (Nullstellensatz over an algebraically closed field); [29] Hader-Kaufmann-Kovacs, SMT solving over finite field arithmetic. Consequence for the paper: this is a scholarly CORRECTION, not just a disclaimer. We are the first to implement that suggested extension in the finite-RING setting ZMod(2^d), and we show it is unsound there (ZMod(2^d) is not a field; the trick encodes "f is a unit", and a nonzero non-unit such as 2 in Z_4 breaks it -- see DisequalityRefutation.lean::rabinowitsch_unsound_over_zmod). The cited field results remain correct; we caution against importing them to finite rings, and use the sound ideal-membership / vanishing refutation that the SCA multiplier-verification literature uses. - Added references: cox2015ideals, hader2023finitefield. - remaining-work.md Item 14b: provenance + recommended framing. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…e width bug
Replaces the incomplete ideal-membership-only disequality handling with
Song et al.'s ring-aware encoding z*(a-b) - 2^{d-1} = 0, which is
equisatisfiable with a != b over ZMod(2^d) (Song et al. Prop. 5/6) and
hence sound AND complete for the refutation (UNSAT) direction. Added at
the single- and disjunctive-disequality sites in try_algebraic_solve;
the vanishing test remains as a fast sound pre-check.
Root-caused and fixed a PRE-EXISTING soundness bug exposed by the
encoding (and also latent in the previous ideal-membership path):
extract_predicate could emit the spurious constant `1` (i.e. 1 = 0) for
a width-inconsistent signed comparison. A signed `0 <s x` is
canonicalised to `x >=s 1`, sign-XOR-transformed into a lower bound
whose constant (2^31+1) was computed at width 32 while the ring width
was d=16; 2^31+1 >> 2^16 tripped the "bound impossible" branch,
poisoning F so the main Groebner check reported a spurious UNSAT.
Fix: a faithful d-bit constant always satisfies C < 2^d, so if
C >= 2^d the operands are width-inconsistent and we return nullopt
(defer to bit-blasting) instead of emitting an unsound constant. Never
fires on well-formed predicates (no completeness loss).
Completeness improvement: has_constant now refutes on ANY non-zero
constant, not only odd/unit ones. A constant evaluates to itself under
every assignment but every ideal element must evaluate to 0, so a
non-zero constant (even an even one) admits no solution. This is needed
for completeness of the 2^{d-1} encoding, whose refutation witness may
be even.
Lean (formal-proofs/DisequalityRefutation.lean, zero sorry):
- nonzero_constant_no_solution: any non-zero constant in the ideal =>
no solution (generalises the odd/unit case; backs has_constant).
- song_encoding_equisat: Song Prop. 5/6, u != 0 iff exists z, z*u =
2^{d-1}, with the forward direction proved by the 2-adic valuation
argument (u = 2^a * odd, a < d) reusing the odd=>unit fact.
Verification:
- Broad SAT scan (3,677 SAT bvmul benchmarks): 0 non-float wrong-unsat
with full Song enabled. The 35 float wrong-unsat are pre-existing
(reproduce with DISABLE_ALGEBRAIC=1), a separate bit-blaster issue.
- SABER, Brain, and cohencu_0/1 (Bitwuzla-confirmed unsat) preserved.
cohencu_2/3 + geo3.c_5 remain undecided (time out) -- they are
unverifiable (z3/cvc5/Bitwuzla all time out at 300s; declared
unknown) and their old "unsat" came from the unsound unit-trick, so
not claiming them is correct.
- groebner unit tests 11/11 (the even-constant case now correctly
UNSAT); cbmc + smt2_solver regression tests pass; coverage check
passes; build clean, no warnings.
Soundness is established; this is a sound UNSAT pre-solver (not a
standalone decision procedure) -- non-refuted queries still defer to
bit-blasting, which provides overall completeness for QF_BV.
Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
3-assert ddmin minimisation (from Sage2/bench_2155, declared sat) of the spurious-constant bug fixed in 2ff48a05bf, with a README explaining the root cause and fix. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…nowitsch framing)
Updates the algebraic paper to reflect the implemented soundness fix
(Item 14). The textbook field Rabinowitsch trick is unsound over
ZMod(2^d); we adopt Song et al.'s ring-aware z*(a-b) - 2^{d-1} = 0
encoding, which is equisatisfiable with a != b (Song Prop. 5/6).
- Replaced the false "Soundness" theorem/proof (which claimed the
unit-trick sound) with Theorem (Song Prop. 5/6): u != 0 iff exists
z, z*u = 2^{d-1}, plus the non-zero-constant refutation criterion,
citing the mechanised DisequalityRefutation.lean theorems.
- Added a remark documenting that the textbook Rabinowitsch trick is
field-only (sound over fields per cox2015ideals / hader2023finitefield
/ Ozdemir23cvc5; unsound over the finite ring ZMod(2^d)), and that a
prototype using it was unsound on real benchmarks.
- Contributions table + list, the Buchberger algorithm pseudocode
(now refutes on ANY non-zero constant, not only odd), the two worked
examples (commutativity, cohencu), the evaluation prose, the Lean
soundness-contributions item, and the conclusion all updated from
"Rabinowitsch" to Song's 2^{d-1} encoding.
- Related work: new "Scaling to large bitwidths" paragraph citing
Hofstadler et al. (multimodular homomorphic images) and the SCA
multiplier line (kaufmann2019, ritirc2017/2018), with an honest
limitation: multimodular CRT does not transfer to native ZMod(2^d)
reasoning (prime-power modulus; Boolean-gate model equivalence).
- Bib: added cox2015ideals, hader2023finitefield (earlier), and
kaufmann2019, ritirc2017, ritirc2018, hofstadler2026.
- Corrected the now-stale Lean theorem count (135 -> over 150) to
reflect the added DisequalityRefutation theorems.
Paper compiles cleanly (no undefined refs/citations).
Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…w=16)
The cohencu-style-quadratic smt2_solver test expected UNSAT, but at
bw=32 that UNSAT was only reachable via the textbook Rabinowitsch
unit-trick removed in Item 14 (unsound over Z_{2^d}). With Song et
al.'s sound z*(a-b) - 2^{d-1} encoding the refutation no longer
completes within the regression budget at bw=32 (a completeness
limitation relative to the earlier unsound shortcut), so the test hung.
The query is genuinely UNSAT at any bitwidth (substituting the two
constraints makes (z*z+12) - (6z+12y) vanish identically). Reducing to
bw=16 keeps deterministic, oracle-verified (z3 + cvc5 both UNSAT)
coverage of the F4 tail-reduction / non-leading-term path while
remaining within budget. Comment updated to describe Song's encoding
and the bw=32 limitation.
Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…qual() Commit d06d50c678 added redundant clauses (eq[i] OR eq[i+1]) for 10-13 bit equality checks, on the premise that "if bit i differs, the adjacent bit must be equal". That premise is false: two adjacent bits can both differ (e.g. op0=0b00, op1=0b11 has eq[0]=eq[1]=false), so the clause over-constrains the SAT instance and makes satisfiable formulas containing 10-13 bit equality checks spuriously UNSAT. Found via a corpus soundness sweep: 98/98 declared-sat mcm/ benchmarks and a family of float/ benchmarks returned unsat where z3, cvc5 and upstream all return sat; git bisect identified d06d50c678 as the first bad commit. Removing the clause flips all of them back to sat. Regression test distilled (via ddsmt) from mcm/06. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
The assert command defers all assertions into deferred_assertions, which are only encoded (solver.set_to_true) at solving time. check-sat flushes them, but check-sat-assuming never did, so the entire regular assertion stack was silently ignored under check-sat-assuming: e.g. (assert false) (check-sat-assuming ()) wrongly returned sat. Flush deferred_assertions in check-sat-assuming exactly as check-sat does, before adding the temporary assumptions. Covered by the existing regression test basic-bv1/check-sat-assuming1 (previously failing). Minimal reproducer distilled via ddsmt. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…nctions Deferred bit-blasting collects assertions and replays them in finish_eager_conversion() when the algebraic pre-solver does not refute. The replay happened after functions.finish_eager_conversion(), which generates the congruence axioms for uninterpreted-function applications from the set of applications seen so far. A term occurring only inside a deferred assertion (e.g. m(zero_extend(x))) was therefore registered too late to receive its congruence axioms, dropping constraints and yielding spurious sat. For example: (assert (= #x00000001 (m #x00000000))) (assert (= #x00000000 ((_ zero_extend 31) x))) ; forces x = 0 (assert (= #x00000000 (m ((_ zero_extend 31) x)))) is unsat (m(0) = 1 and, by congruence, m(zero_extend(x)) = m(0) = 0) but was reported sat. Move the replay before functions.finish_eager_conversion() (array axioms in SUB::finish_eager_conversion already run after the replay, so they were unaffected). The algebraic solve still runs first so a successful refutation still short-circuits the replay. This fixes the bug at the source; DISABLE_DEFER_BITBLAST=1 is no longer needed. Regression distilled via ddsmt from uninterpreted-functions/uf1. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…type
The Level-3 candidate-assignment hint in try_algebraic_solve rebuilt each
variable's symbol as unsignedbv_typet(bw), where bw is the ring bitwidth
of the Groebner extractor. For a variable whose real width differs from
bw this created a width-inconsistent entry in boolbv_map: the subsequent
normal conversion of an assertion mentioning that symbol then tripped the
get_literals invariant (literal_map.size() == width) and aborted.
This crashed cbmc (SIGABRT) on Initialization6 and havoc_slice/test_struct_*
whenever the algebraic layer was active.
The hint is a soft, gate-retractable guidance clause, so it is sound to
reconstruct the symbol with its real type (collected from the typed
equation expressions) and to skip any variable whose width does not match
the ring width. Now no width-inconsistent map entry is ever created.
Verified: Initialization6 and havoc_slice/test_struct_{a,c,d} run to
completion again with the algebraic layer enabled; algebraic-soundness
tests and the full smt2_solver suite still pass.
Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…ttribution Adds soundness-sweep-findings-2026-06-03.md documenting the corpus-scale differential sweep, the four bugs it found (with ddsmt-minimised reproducers and git-bisect root causes), and the fixes (8a6f7b2ef1, 9c937299ab, c1ceecbdd0, 7fab6c5d8f). Key correction: the float/mcm wrong-unsat results were previously described in ablation-audit-2026-06-01.md as "pre-existing CBMC bit-blaster" issues. A merge-base baseline build returns the correct sat, and git bisect attributes them to branch commit d06d50c678 (an unsound adjacent-equality SAT-encoding optimisation), now fixed. Added correction notes at both spots in the audit doc. The algebraic pre-solver itself remains sound: 0 wrong answers attributable to it across the full bvmul sat corpus (4,361) plus a 4,000-file cross-corpus sample. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…suite + per-tweak ablation Re-ran the evaluation with the now-sound binary (after the four soundness/robustness fixes) and refreshed paper.tex: - Random-polynomial: replaced the 210-benchmark random sample (not reproducible -- the harness used shuf) with the ENTIRE suite, all 3498 benchmarks across both seeds and eight categories. Solved counts: shift-add 1972, Comba/CS 1970, algebraic 2128, Bitwuzla 3021, cvc5 2678 (union 3104). Wins beyond all solvers: 59 (was 4 on the sample); over bit-blasting: 257 (was 18). Per-category table updated to full-suite denominators. Removed the stale per-benchmark timing table (and the 0.03-0.20 vs 0.3-3.7s discrepancy). Soundness: 0 sat-vs-unsat disagreements and 0 verdicts contradicting declared :status across all 3498 x 5 configurations. - SMT-COMP sample (66): refreshed to shift-add 37/66 (PAR-2 57.1) and algebraic 43/66 (PAR-2 44.8); 0 disagreements; still overtakes cvc5 and unlocks 8 benchmarks bit-blasting cannot decide. - Added a leave-one-out per-tweak ablation (Table tab:tweak-ablation) on a 26-benchmark cross-suite subset: Tseitin propagation (4 wienand), the vanishing test (3), and the boolean tree walk (geo3.c_5) are load-bearing for solvability; the finer tweaks (F4 interreduction, deferred bit-blasting, if-case elimination, nonzero fast-path, GB ordering) affect only running time. - SABER 31/31, custom 39/39, DSP 5/5 re-confirmed (0 disagreements). Raw results added under doc/paper-algebraic/data/. Paper compiles cleanly (46 pages, no undefined refs). Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…tive, validation methodology)
Reconciles the paper with the post-fix code and resolves reviewer-flagged
inconsistencies:
- Stale numbers: intro and conclusion "4 wins" -> 59 (full-suite result);
conclusion "18 Lean modules" -> 20 (matches the status summary and the
actual module count).
- "No correctness regressions / 691 tests": corrected to the accurate CORE
count (1167), states zero soundness/correctness regressions with the
algebraic layers enabled, and notes the two remaining failures are
non-algebraic timeouts (reproduce with the layer disabled). Added a
validation-methodology paragraph (mechanised proof + corpus-scale
differential sweep vs z3/cvc5/baseline) and disclosed that the sweep
found and fixed integration bugs (the get_literals width abort).
- odd/unit-constant -> any non-zero constant: swept the prose and
Algorithm-1 narrative; reframed the StrongGB negative results
(naive_completeness_is_false, F={C 2}) as the motivation for
generalising has_constant from odd-only to any non-zero constant
(sound by nonzero_constant_no_solution), under which F={C 2} is now
decided UNSAT; residual incompleteness (function-vanishing, budget)
retained. Updated StrongGB.lean header to match.
- Defer.lean: documented the faithfulness ASSUMES (verdict depends on the
committed set, not replay order) and that a real bug (UF-congruence
replay ordering, found by differential testing, not by the proof)
violated it; the fix restores it. Added as the third
ASSUMES/MAINTAINED-BY story in the conclusion and proofs section.
- Theorem 1 (vanishing): made domain width n vs codomain width m explicit
and scoped the small-n case to the mixed-width remark.
- SABER N=256: reconciled 11.25 s (Table saber-scaling) vs 13.4 s
(Table saber-defer) as uninstrumented vs RSS-instrumented runs.
- Clarified the 150 theorems / 65 mapping entries / 37 source entities
denominators.
Paper compiles cleanly (47 pages, no undefined refs). Lean edits are
docstring-only.
Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
These are patch/merge backup files (*.orig) that were committed by accident; they are duplicates/leftovers of their live tracked counterparts (or have no counterpart) and should never have been tracked. Removing them; the live files are unaffected. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Prevents recurrence of the accidentally-committed build-cov/build-debug trees, *.gcov/*.gcno coverage output, and *.orig patch backups. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Updates REPRODUCE.md to match the refreshed paper: the full 3498-benchmark random-polynomial run (replacing the old 210 sample), the leave-one-out per-tweak ablation, the corpus differential soundness sweep, the refreshed SMT-COMP/SABER/custom numbers, the complete DISABLE_* ablation-variable table, and pointers to the new data/*.tsv files and the soundness-sweep findings doc. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…smoke vs z3) Wires two guards recommended in the paper: - proof-traceability lint: runs scripts/check_proof_traceability.py (no build) so every PROOF: reference resolves and every UNSAT conclusion in try_algebraic_solve stays annotated; - soundness smoke: builds smt2_solver and runs the algebraic regression queries (adjacent-equality, deferred-replay, rabinowitsch, cohencu-F4, check-sat-assuming), asserting each verdict matches the expected result AND agrees with z3 -- the wrong-verdict cross-check the standard output-matching regression harness does not provide. New script scripts/algebraic_soundness_smoke.sh (also runnable locally). Triggers on changes to the algebraic source, boolbv/bv_utils/smt2, the proofs, and the smt2_solver regressions. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
The deferred-bit-blasting replay in finish_eager_conversion called SUB::set_to directly, bypassing the boolbv_set_equality_to_true optimisation that the eager (non-deferred) tail of set_to applies to equalities. For queries with large equalities and no algebraic content (e.g. union_large_array's 100k-element array union) this fully bit-blasted the equality on replay, producing a far harder SAT instance than the eager path and timing out (>300s vs <1s eager). git bisect identified the deferred-bit-blasting default-on commit (d6f3c405a4) as the first bad commit for this regression. Replay now inlines the post-deferral tail of set_to (the boolbv_set_equality_to_true fast path) so deferred replay reproduces the eager clause structure, not just the eager models. union_large_array returns to ~1s; algebraic regression, SABER, and the smt2_solver suite are unaffected. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…rray) as found+fixed Updates the validation paragraph: the union_large_array CORE timeout was root-caused (deferred replay bypassed the boolbv_set_equality_to_true eager optimisation, fully bit-blasting a 100k-element array-union equality; bisected to the deferral-default commit) and fixed by applying the equality optimisation on replay. Only a pre-existing thorough-tagged constant-array timeout remains (also times out on the baseline). Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…2 45.8)
Re-measured the two cbmc rows of the SMT-COMP sample under the paper's
6GB memory limit (the prior 43/44.8 used 12GB): shift-add 37/66 (57.1),
algebraic 42/66 (45.8) -- one benchmark needs >6GB. Numbers confirmed
identical before and after the deferred-replay equality fix (no large
arrays in this sample). Caption now states the 6GB limit; raw data in
data/smtcomp-sample-{shiftadd,algebraic}-6gb.tsv.
Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…ding) - Combined-effect: replace the misleading 'unlock eight (37/66 -> 42/66)' with the accurate net: algebra decides 8 that shift-add cannot while shift-add decides 3 the algebraic config times out on (net +5). - Conclusion: 'a third bug' -> 'further integration bugs ... most tellingly the deferred-replay issue', consistent with the three differential-found fixes documented in Threats. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
76757cc to
00f468b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Depends-on: #8990
The experiments were performed using the following example:
See https://tinyurl.com/adder-comparison
(https://docs.google.com/spreadsheets/d/1ckrIwxe5QVUu12rs3OaY3o7CQ-UqFnEWvcEOf8cw580/) for results.