Skip to content

Experimenting with different adder implementations#7993

Draft
tautschnig wants to merge 432 commits into
diffblue:developfrom
tautschnig:features/adder
Draft

Experimenting with different adder implementations#7993
tautschnig wants to merge 432 commits into
diffblue:developfrom
tautschnig:features/adder

Conversation

@tautschnig

@tautschnig tautschnig commented Nov 2, 2023

Copy link
Copy Markdown
Collaborator

Depends-on: #8990

The experiments were performed using the following example:

 #ifndef SIZE
 #define SIZE 100
 #endif

 #ifndef UNSAT
 int main()
 {
   int a[SIZE], b[SIZE];
   for(int i=0; i < SIZE; ++i)
     __CPROVER_assert(a[i] + b[i] > a[i], "");
 }
 #else
 #include <limits.h>

 int main()
 {
   int a[SIZE], b[SIZE];
   for(int i=0; i < SIZE; ++i)
   {
     __CPROVER_assert(
       b[i] <= 0 ||
       (a[i] >= (INT_MAX >> 28) || b[i] >= (INT_MAX >> 28)) ||
       (a[i] <= (INT_MIN >> 1) || b[i] <= (INT_MIN >> 1)) ||
       a[i] + b[i] > a[i], "");
   }
 }
 #endif

See https://tinyurl.com/adder-comparison
(https://docs.google.com/spreadsheets/d/1ckrIwxe5QVUu12rs3OaY3o7CQ-UqFnEWvcEOf8cw580/) for results.

  • Each commit message has a non-empty body, explaining why the change was made.
  • Methods or procedures I have added are documented, following the guidelines provided in CODING_STANDARD.md.
  • n/a The feature or user visible behaviour I have added or modified has been documented in the User Guide in doc/cprover-manual/
  • Regression or unit tests are included, or existing tests cover the modified code (in this case I have detailed which ones those are in the commit message).
  • My commit message includes data points confirming performance improvements (if claimed).
  • My PR is restricted to a single feature or bugfix.
  • n/a White-space or formatting changes outside the feature-related changed lines are in commits of their own.

@codecov

codecov Bot commented Nov 3, 2023

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0.97087% with 306 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.32%. Comparing base (6df204e) to head (1fd90bb).

Files with missing lines Patch % Lines
src/solvers/flattening/bv_utils.cpp 0.97% 306 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #7993      +/-   ##
===========================================
- Coverage    80.46%   80.32%   -0.14%     
===========================================
  Files         1704     1704              
  Lines       188665   188973     +308     
  Branches        73       73              
===========================================
- Hits        151805   151801       -4     
- Misses       36860    37172     +312     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

tautschnig and others added 24 commits June 8, 2026 19:11
…ound

Beame and Liew (CAV 2017 / JACM 2019) proved polynomial-size
regular resolution refutations exist for commutativity and any
degree-2 ring identity on array, diagonal, and Booth multipliers,
and quasi-polynomial (n^O(log n)) for Wallace tree multipliers.
This refutes Biere's 2016 conjecture of exponential hardness.

Our empirical CDCL data (N2 experiment, tab:n2) is still valid
but the framing must change: what we observe is NOT an inherent
proof-complexity lower bound -- it is the gap between the
*existence* of short proofs (proven) and *CDCL's ability to
find them* on standard encodings. Beame and Liew themselves
identify this as an open engineering problem, calling for
'guiding information to add, either to the formulas derived from
the circuits or to CDCL SAT solver heuristics'. Our encoding-
selection approach is precisely such guidance.

Reframing applied in six places:
1. Abstract: 'SAT hardness' -> 'CDCL search difficulty';
   adds reference to Beame-Liew upper bound.
2. §1 contributions (#2): 'causal hardness driver' ->
   'causal driver of CDCL search difficulty' with explicit
   note that proof-theoretic hardness does not hold.
3. §3 Carry Propagation Hardness: new paragraph 'The gap is
   between proof existence and CDCL search' with Beame-Liew
   quotation; removes the old 'grows exponentially' claim and
   Haken-1985 reference as misleading analogies.
4. §3 N2 observation (1): 'Carry presence is the dominant
   causal factor' -> 'Carry presence dominates CDCL search
   difficulty'.
5. §3 N2 observation (4) and Scope of the claim: rewritten
   to cite Beame-Liew upper bound and frame our result as
   CDCL-search-relative, not proof-complexity-absolute.
6. Conclusion: reframes the carry-propagation finding as CDCL
   search difficulty; replaces the N1-as-future-work paragraph
   with 'can CDCL be steered to find Beame-Liew's polynomial
   proofs without encoding guidance' as the open question.

Also adds to Related Work: new 'Resolution upper bounds for ring
identities' paragraph introducing Beame-Liew's construction,
citing their open problem, and positioning our work as
addressing it via encoding engineering.

New bib entries: beame2017towards (CAV), beame2019toward (JACM).

Paper 1: 21 pages, builds clean.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…ing-learned-clauses

Adds two more bibliography entries in the 'escape Resolution hardness'
line of work, and extends the Related Work and Conclusion to
contextualize our encoding-selection approach alongside them.

New bib entries:
- danner2025sat: Julian Danner, 'SAT Solving Using XOR-OR-AND Normal
  Forms and Cryptographic Fault Attacks', PhD thesis, University of
  Passau, April 2025 (reviewers: Kreuzer, Biere). Dissertation is
  citable; it develops CDXCL/CDXCLlite, a CDCL generalization to
  XNF, implemented as Xorricane/2-Xornado, provably escaping
  Resolution lower bounds on XOR-rich instances.
- pollitt2026factoring: Pollitt, Battleman, Fleury, Vizel, Heule,
  Biere, Bryant, 'Factoring Learned Clauses', SAT 2026 (to appear).
  Revives Extended Resolution in modern CDCL: CaDiCaL-ER (GlucosER-
  style definition introduction during conflict analysis) and
  CaDiCaL-FX (BVA extended to XOR/ITE gate detection on learned
  clauses), with polynomial-size ER proofs for Tseitin formulas.

Paper changes:
- Related Work: The 'Resolution upper bounds for ring identities'
  paragraph (which discusses Beame-Liew 2017/2019) is now followed
  by a new 'Escaping resolution hardness in CDCL' paragraph that
  positions three complementary approaches:
  (i) encoding-level guidance (our paper),
  (ii) solver-level generalization to stronger proof systems
      (Danner 2025, and related work on CDCL + XOR reasoning
      with bidirectional Res(XOR) simulation),
  (iii) inprocessing-level definition introduction
      (Pollitt et al. 2026 CaDiCaL-FX).
- Conclusion: updated fourth future-work item to acknowledge
  that two parallel research lines (solver-level generalization
  and inprocessing-level factoring) are also active, and that
  encoding-, solver-, and inprocessing-level guidance are
  combinable.

The Beame-Sun 'CDCL(XOR)' paper (sat-paper90.pdf) is discussed
without citation, as it is accepted but not yet published; we
credit the broad line of work instead.

Paper 1: 21 pages, builds clean.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Authors (alphabetical by surname):
- Armin Biere (Amazon Web Services and University of Freiburg)
- Martin Brain (City St George's, University of London)
- Daniel Kroening (Amazon Web Services)
- Norbert Manthey (Amazon Web Services)
- Michael Tautschnig (Amazon Web Services and Queen Mary
  University of London)

Replaces the 'Michael Tautschnig + TBD' placeholder.

Paper 1: 21 pages, builds clean.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
The revised research plan set reflects the discovery that Beame &
Liew (CAV 2017, JACM 2019) have polynomial-size regular resolution
proofs of degree-2 ring identities on array, diagonal, and Booth
multipliers - refuting the exponential-hardness conjecture that
underpinned the earlier N1 plan.

**N1 (rewritten)** - research-plans/N1-cdcl-vs-beame-liew-gap.md:
  Replaces the exponential-lower-bound plan (now dead). New goal:
  characterize why CDCL does not find Beame-Liew's polynomial
  proofs on standard multiplier encodings, and under what formula-
  level transformations it does. Three proposed approaches
  (empirical, interpolation-theoretical, hybrid). Supersedes
  N1-exponential-lower-bound.md, which is removed.

**N3 (new)** - research-plans/N3-beame-liew-implementation.md:
  Plan to implement Beame-Liew's critical-strip construction as
  an executable DRAT proof generator. Reviewer communication
  indicated Beame's student attempted and failed; the failure
  mode is undocumented. We propose phased implementation (single
  strip -> multi-strip -> full refutation -> scaling experiments)
  with heuristic extraction as a stretch goal. Orthogonal to N1
  but could supply Approach A's gold-standard reference proof.

**N4 (new)** - research-plans/N4-multi-encoding.md:
  Ablation design for a new encoding technique: encode the same
  multiplication with two different encodings, shared inputs
  and outputs, separate internal variables, output-equality
  constraints. Implemented as 'secondary_encoding' string on
  bv_utilst, wrapper at the top of unsigned_multiplier, env-var
  CBMC_MULTI_ENCODING wiring in smt2_solver. Early sanity check
  on comm_12 (algebraic disabled):
    - shift-add alone:       533s (T/O-prone)
    - comba-cs alone:        1.04s
    - multi comba-cs + shift-add:  1.08s
  suggests a portfolio-like effect: ~4% overhead vs the faster
  single encoding, 500x speedup vs the slower one. Full
  ablation sweep is running in the background; plan describes
  three possible outcomes (positive / portfolio / negative)
  and their writeup implications.

Implementation details:
- src/solvers/flattening/bv_utils.h: new member
  'secondary_encoding' (std::string), getter/setter.
- src/solvers/flattening/bv_utils.cpp: wrapper at top of
  unsigned_multiplier that, when secondary_encoding is
  non-empty, computes both encodings and set_equal's outputs.
  Handles shift-add/comba/comba-cs/dadda/dadda-cs/wallace/
  booth/block4/sortnet as valid secondary names.
- src/solvers/flattening/boolbv.h: passthrough setter.
- src/solvers/smt2/smt2_solver.cpp: reads CBMC_MULTI_ENCODING
  env var in configure_encodings.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Mirror the cross-cutting notation and clarity fixes made in the
co-authoring repository (~/multiplier-encodings.git) so the two
versions of Paper 1 stay close in substance even while Martin's
intro rewrite diverges.

Changes applied here:

- 'bitwidth' -> 'bit-width' (15 occurrences) for notation
  consistency.
- 'partial product matrix / array' -> 'partial-product grid'
  throughout.
- Expanded Comba description in Background to explicitly explain
  why the column-wise + immediate-carry combination limits
  inprocessing.
- Rewrote E0-E4 variant descriptions in §3 with explicit
  accumulation-rule pseudo-code, plus a preamble clarifying that
  all five variants share identical partial products and differ
  only in the accumulator.
- Added kroening2016decision and barrett2021smt bib entries (as
  canonical references for bit-blasting).

Paper 1: 21 pages, builds clean.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
N4 ablation sweep completed: 9 benchmarks x 4 primary encodings x
7 secondary encodings (including singleton baselines) = 216
measurements on CaDiCaL with algebraic pre-solver disabled,
60s timeout.

Headline finding: multi-encoding exhibits PORTFOLIO-LIKE BEHAVIOR.
The solver receives both encodings of the same multiplication
with shared inputs and outputs, and performs approximately as a
free portfolio of the two singletons:

- Median multi/min(A,B) = 1.31x  (30% overhead vs best singleton)
- Geom mean multi/min(A,B) = 1.23x
- Median multi/max(A,B) = 0.14x  (7x faster than slower singleton)
- 42% of multi runs within 1.25x of min(A,B)
- 40% more than 1.5x slower than min(A,B) -- trade-off is real
- 44 'rescue' cases where one singleton T/Os but multi solves

New Paper 1 subsection 5.4 'Multi-Encoding Combinations' presents
the implementation, the ablation design, and the portfolio finding
with a headline table. Places the technique alongside the adaptive
encoding-selection heuristic in §5 as a complementary strategy
when pattern recognition is not reliable.

Updates N4 research plan to 'Executed' status with the headline
statistics and the verdict (Outcome N: portfolio-like).

Raw data: data/multi-encoding-results.tsv (216 measurements).

Paper 1: 22 pages, builds clean.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Applied ~17-20 word-form replacements per paper via a conservative
Python script (tmp/us-to-uk.py):

- -ize -> -ise family: normalize, characterize, mechanize,
  generalize, optimize, recognize, minimize, maximize, formalize,
  factorize, organize, emphasize, utilize, hypothesize, specialize,
  initialize, standardize, realize, summarize.
- -ization -> -isation (where applicable).
- -yze -> -yse: analyze, catalyze.
- -or -> -our: behavior, favor (present in 1 location).
- -el -> -ell doubling: modeled -> modelled, labeled -> labelled
  (not present in our text but included for safety).
- US catalog -> UK catalogue (1 location, already British from
  earlier edits).

LaTeX environment and command contexts preserved (negative
lookbehind for backslash ensures e.g. \color{...} in todo macros
and \begin{itemize} are untouched). Technical CS terms kept:
'program' (software sense) stays in 'program verification';
'size' stays as is.

Also corrected the three 'Mechanized' (capitalised) missed by
the initial pass.

Paper 1: 22 pages, builds clean. Paper 2: 20 pages, builds clean.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Mirrors the senior-reviewer fix commit d4f06dc from
multiplier-encodings.git. Same substantive changes:

- Harmonise BVE story in §5.1 (three-regime breakdown).
- Fix broken table/section references (Table 4 -> Table 10;
  Section 6 below -> Table tab:best).
- Fix DRAT size for strength_chain_16 (10 MB -> 3.3 MB).
- Remove companion-algebraic citations (Paper 2 not in submission
  state yet; do not cite).
- Clarify multi-encoding T/O accounting.
- Soften 'Proof-guided methodology' -> 'Proof-guided encoding
  design' with explicit scope.
- Fold §6.5 Multi-Encoding into §5.2 Multi-Encoding as Hybrid
  Selection.
- Remove stale 1728-configuration sweep.
- Comment out §4 'Development effort' paragraph.
- Add benchmark-pool intro to §7; add missing labels.

Paper 1: 22 pages, builds clean.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Mirrors commit from multiplier-encodings.git. Body reduced from
19 pages to 16 pages by moving supplementary tables to the appendix
and compressing prose in §3 (N2 variants and observations), §4
(automation paragraph), §6.1 (Booth), §7.10 (SMT-COMP), and §8
(Negative Results). No claims removed.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Same four fixes as in multiplier-encodings.git e775198:

- Figure 1 Pass 1/Pass 2 labels moved below their arrows (were
  above, colliding with 'Carry bits:' and 'Result:').
- Adaptive-heuristic ablation table moved to a new appendix, with
  a compact summary retained in the body.
- Conclusion + future-work condensed: single summary paragraph +
  single open-directions sentence, down from ~45 lines to ~25.

(No duplicate subsection heading in this repo's version, so that
fix is not mirrored.)

Main body now ends on page 16-17 (21 pages total).

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Same six fixes as multiplier-encodings commit 7xxx:

1. Reworded three Martin Brain 'private communication' references
   to 'earlier unpublished work by one of the authors' (Martin is
   now a co-author).
2. Redesigned Figure 1 with header-style section labels above
   each block group, inline 'sums'/'carries' labels, and Pass
   labels properly spaced above arrows.
3. Folded the one-sentence DRAT lead-in into the ablation
   paragraph.
4. Removed 'A methodological caveat: encoding flags must reach
   the solver' paragraph (internal engineering observation, no
   scientific value to this audience).
5. Merged three-paragraph §7 opening into a single paragraph.
6. Tightened paragraphs with 1-2 word orphan last lines.

Main body ends on page 16; total 21 pages.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Mirrors multiplier-encodings.git commit: adds Wallace as an eighth
encoding to the comparison. Key finding (Wallace does NOT track
Dadda, differs by up to 14x on commutativity) applied to:

- Contribution bullet (six- -> eight-encoding)
- Section 2 Background (Wallace alongside Dadda)
- Figure 2 scaling plot (seven encodings including Wallace)
- Appendix D tab:proof-sizes (eight encodings including Wallace)

Same data:
- Wallace proof sizes on comm_11/13 are 94 MB / T/O (vs Dadda
  6.5 MB / 98.6 MB) - dramatically worse
- Wallace scaling times 18.6x slower than Dadda at BW=11
- Wallace tracks Dadda on non-commutativity benchmarks
  (mul_ineq, overflow, strength reduction: within 2x)

Paper still 21 pages; body ends page 16.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Mirrors multiplier-encodings.git commit to Paper 1 plus related
Paper 2 update.

Paper 1 (doc/paper-bitblasting/paper.tex):
Same 10 review items (M1-M10) plus cut/compress/polish as the
multiplier-encodings commit. Body reduced from 19 to 14 pages.

Paper 2 (doc/paper-algebraic/paper.tex):
- Replace the short 'two classes' taxonomy (universal/equational
  vs existential) with the full 4-class taxonomy (universal vs
  existential, crossed with equational vs relational) that moved
  here from Paper 1. The full taxonomy is a more natural fit for
  Paper 2 since the algebraic techniques cover the universal
  equational quadrant specifically.
- Remove duplicate 'bit-blasting companion paper' citation that
  now only needs to appear once.

New SMT-COMP isolation data files:
- doc/paper-bitblasting/data/smt-comp-combacs-nosimp.tsv (66
  benchmarks, smt2_solver, combacs encoding, no simplify/algebraic)
- doc/paper-bitblasting/data/smt-comp-shiftadd-nosimp.tsv (same,
  shift-add encoding)

Both configurations solve 41/66 with identical PAR-2 (52.1 s);
per-benchmark ratio on both-solved between 0.83x and 1.15x,
confirming the encoding alone is neutral on a broad community
sample.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Mirrors multiplier-encodings.git author-review tweaks 3-5 (the §2
grammar fix did not apply here because cbmc-github had different
phrasing already) and applies the same M1 treatment to the intro
that was applied to Paper 1 in the previous commit on the other
repo.

- Remove the shorter 2-class taxonomy from cbmc-github intro to
  match multiplier-encodings (full 4-class taxonomy now lives in
  Paper 2 intro; a single-sentence segue remains in Paper 1 §10).
- 'An influential line of work' -> 'Beame and Liew'.
- Fix double-hyphen ligature in \texttt{--refine-arithmetic}.
- Spell out Brent-Kung [BK] at first use.

Both papers verified: --refine-arithmetic renders as two hyphens
in the output, Brent-Kung is spelled out in §8.2 (not only in
Appendix A).

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Mirrors multiplier-encodings.git: adds Burch 1991 (early
BDD-based multiplier verification), Kumar et al. 2023 (BDD with
partial-product variables, O(n^3) size up to 1024 bits), and
Dutertre 2020 (empirical evaluation of 16 SAT solvers on QF_BV
CNFs finding CaDiCaL superior). Same integration points: Burch
and Kumar in §9 'Algebraic verification'; Dutertre in §7.3
'Cross-Solver Analysis'.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Same fix as multiplier-encodings.git: restore the five per-row
mechanistic explanations of which encoding wins on which pattern,
plus the adaptive-selection consequence.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Captures the research-plan status (N1/N2/N3/N4), paper status, and
recommended prioritisation after the senior-review passes and the
Wallace integration.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…ommutativity

Implements the first prototype for research plan N3 (Beame-Liew
critical-strip construction).

What works:

- bench-multiplication/n3-beame-liew/generate_array_mul_comm.py:
  self-contained DIMACS generator for a*b = b*a on an n-bit array
  multiplier. Validated UNSAT at n = 1..11 by CaDiCaL and minisat.

- bench-multiplication/n3-beame-liew/beame_liew_phase1_v2.py:
  emits a DRAT refutation by
    (i) 2^(2n) leaf 'branch-cut' RUP lemmas (one per (a, b)
        assignment),
    (ii) a binary-resolution tree over the 2n input variables,
    (iii) the empty clause.
  The proof validates with drat-trim at every tested n (1..11).

- On the same CNFs, the prototype's DRAT proof is 4-10x smaller
  than CaDiCaL's DRAT output at n <= 8 and remains smaller at
  n = 9..10 (where CaDiCaL's baseline is itself incomplete at
  the 60s timeout).

  n   BL DRAT  CaDiCaL DRAT  ratio
  2   307      5,002         0.061x
  4   9,987    41,560        0.240x
  6   266,229  994,916       0.267x
  8   6.1 MB   26.4 MB       0.230x
  10  127 MB   213 MB        0.596x

What this is NOT:

This is not the polynomial-size O(n^6 log n) critical-strip
refutation from Beame & Liew 2019. The prototype is O(2^(2n)
* poly(n)), dominated by the flat enumeration. Phase 2-3 of
the N3 plan would replace the 2^(2n) leaves with a read-once
branching program of size poly(n) exploiting the
Delta = log(2n) strip decomposition.

Even so, the prototype establishes that:
(i) the Phase 1 infrastructure works and drat-trim accepts
    the output;
(ii) a structured proof (enumerate first, resolve second) can
     already beat CDCL's proof size at small n by a constant
     factor, suggesting proof structure itself is a useful
     piece of guidance.

Also updates research-plans/N3-beame-liew-implementation.md with
a Status section summarising the above.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Extends the N3 work from the previous Phase 1 commit with a
structural Phase 2 prototype, a BDD sanity check, and an honest
writeup of why the polynomial Beame-Liew bound remains out of
reach within the time budget of one session.

New artefacts under bench-multiplication/n3-beame-liew/:

- generate_array_mul_comm_meta.py: CNF generator that tags every
  variable with its role (input bit, partial product, accumulator,
  carry, output, diff). Required for any strip-level analysis.

- beame_liew_phase2_v2.py: produces a column-structured DRAT
  refutation. For each output bit k in [0, 2n) the generator
  emits a Phase-1-style resolution tree terminating in the unit
  -diff[k], then UP on the diff-OR clause gives the empty
  clause. Validates with drat-trim at n = 2..7.

  The proof is strictly LARGER than Phase 1's flat enumeration by
  ~2n (empirically 5.6x at n=2 -> 15.8x at n=7), confirming that
  per-column decomposition alone cannot beat flat enumeration:
  each column sub-proof has a different suffix literal and cannot
  share the input-enumeration tree.

- phase2_bdd_probe.py: builds BDDs for c[k] and d[k] using the dd
  Python package and confirms structural equality as a sanity
  check that the array multiplier commutativity CNF is correctly
  encoded. Reports per-bit BDD DAG sizes.

  Empirical BDD sizes match Bryant 1991: peak BDD DAG size 462
  at n=6, 4,419 at n=8, 41,148 at n=10. A BDD-based DRAT
  translation cannot yield a polynomial proof either.

Key technical findings, documented in the README and in the N3
plan's Status section:

1. Unit propagation alone cannot derive -diff[k] from partial-input
   strip refutations (UP does not resolve biconditionals).

2. Per-column DRAT lemmas cannot share the 4^n-leaf input
   enumeration because their suffix literals differ.

3. BDDs of multiplier outputs are exponential (Bryant 1991).

4. The polynomial Beame-Liew bound requires extension variables
   in the DRAT proof (the branching program branches on internal
   tableau variables and encodes shared structure via RAT steps,
   not RUP).

Phase 3 work -- branching program construction + RAT translation
via Krajicek's Prop. 2.6 -- remains open.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Two quick follow-up investigations after the Phase 2 prototype:

Option B (trimmed-core comparison):
  - Extracted CaDiCaL's DRAT proof core using drat-trim -l
  - Compared against Phase 1's proof size
  - Finding: even after trimming, CaDiCaL's core is 2.5-4x larger
    than Phase 1 at n >= 5 (0.31x-0.40x ratio, stable across n)
  - Conclusion: Phase 1's advantage is structural (ordered
    enumeration + explicit resolution tree), not a CDCL-noise
    artefact that trimming could fix.
  - Raw data in data-options-ba.tsv; writeup in options-B-A.md.

Option A (RAT extension-variable test):
  - Wrote rat_test.py: 4-clause UNSAT with a RAT-introduced
    extension variable x3 := x1 AND x2 used in deriving empty.
  - drat-trim reports s VERIFIED for the proof.
  - Confirms drat-trim accepts RAT-format clauses introducing
    extension variables, so the toolchain supports the proof
    format needed for a future polynomial Phase 3 construction.
  - (The trimmed report shows '0 RAT lemmas in core' because RUP
    derivations succeed for these particular lemmas on a small
    formula; RUP is a special case of RAT, so this is expected
    drat-trim behaviour.)

Together these clarify (a) that Phase 1's empirical win is robust
to trimming, and (b) that the proof format permits the
extension-variable encoding that a genuine polynomial Phase 3
proof would need.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Downloaded the 29-page arxiv v2 of Beame-Liew 'Towards Verifying
Nonlinear Integer Arithmetic' (1705.04302v2) which contains the
full critical-strip construction (§3, §4) referenced in the N3
research plan. Needed for Phase 3 implementation.

Key construction details (§3.3):
- The refutation is a read-once branching program (Proposition 2.1
  translates BP to ordered resolution directly).
- For each k in [0, 2n], the BP branches on the 'first-disagreeing'
  output bit being bit k.
- Each strip phi_Strip(k) is the sub-formula with tableau variables
  in columns [k-Delta, k] where Delta = log n.
- The per-strip BP branches on inputs row-by-row, merges on 'Cut(j)'
  states of size O(log k).
- Size bound: O(k^5 log k) per strip (Corollary 3.3), O(k^6 log k)
  total.

No RAT / extension variables are needed -- the BP branches on
existing CNF variables, and merging at Cut(j) nodes gives the
polynomial size. Option A's RAT-readiness is therefore not needed
for the actual construction.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Implements extraction of phi_Strip(k) from the full commutativity
CNF per Beame-Liew 2017 §3.3 Definition. A clause is in the strip
iff it references a column-indexed variable (pp, acc, cry, or
diff) whose column is in [k-Delta, k]. Plus:
- tableau symmetry clauses pp_c[i][j] = pp_d[j][i] from the
  preprocessing step in the paper;
- ZERO-constant unit clauses (without which the carry-chain starts
  from an unconstrained value and the strip becomes trivially SAT).

Validation: at n=4, 6, 8 for k=1..2n-1, the extracted
phi_Strip(k) is UNSAT (verified by CaDiCaL). This matches
Beame-Liew's Lemma 3.1.

Note: early attempts omitted (a) non-tableau carry/acc clauses and
(b) ZERO-constant units, producing SAT strips. Both were bugs in
the extraction, not in the paper.

Step 1 complete. Step 2 (per-strip BP construction with Cut(j)
merging) is the next milestone.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…nential

Built BDDs for all strip variables using the dd package to
measure the BP size we'd get if the BP branched in input-variable
order. Results confirm that naive BDD-based BPs are exponential
for middle-k strips, matching Bryant 1991:

  n=4, k=5: peak BDD 55
  n=6, k=7: peak BDD 462
  n=8, k=9: peak BDD 3,537
  n=8, k=11: peak BDD 4,419

The Beame-Liew O(n^6 log n) polynomial bound relies on a SPECIFIC
variable ordering (outputs o^yx first, then tableau row-by-row,
with merging at Cut(j) states). Our input-variable ordering does
not achieve polynomial size.

Files:
- phase3_bdd_size.py: measures BDD sizes per strip variable.
- options-B-A.md updated with the strip BDD-size table and an
  honest explanation of why Step 2 requires the paper's exact
  variable ordering (not yet implemented).

The polynomial construction remains the open Phase 3 work.
phase3_bp.py was removed (it contained scaffolding for the wrong
approach; will rebuild when implementing the paper's BP order).

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Rather than hand-implementing Beame-Liew's exact BP construction
(which requires weeks of careful work on the Cut(j) merging
structure), prototype the strip decomposition by:
  1. Extract phi_Strip(k) for each k in [1, 2n-1].
  2. Run CaDiCaL on each strip to get a DRAT refutation.
  3. Compose via top-level case split on k.

Empirical result (bench-multiplication/n3-beame-liew/data-phase3-perstrip.tsv):

  n   PerStrip    Phase 1     CaDiCaL full   PS/P1   PS/CaD
  4   137 KB      10 KB       42 KB          13.68x  3.29x
  6   1.2 MB      266 KB      995 KB         4.58x   1.23x
  7   4.6 MB      1.3 MB      5.2 MB         3.56x   0.88x
  8   22 MB       6.1 MB      26 MB          3.55x   0.82x

Per-strip CaDiCaL beats full-CNF CaDiCaL at n >= 7 (0.82x at n=8,
ratio improves with n). So structurally decomposing the proof
into strips helps a SAT solver, but per-strip is still strictly
worse than Phase 1's ordered case-analysis enumeration (3.55x
larger at n=7, 8).

Obtaining Beame-Liew's O(n^6 log n) polynomial would require
hand-implementing the BP with the paper's specific variable
ordering and Cut(j) merging. That is documented as future work
in options-B-A.md; not implemented in this session.

Summary update in options-B-A.md enumerates what's committed
across Phase 1/2/3 and what remains.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
tautschnig and others added 29 commits June 8, 2026 19:11
…nance

Following Armin's review (widen corpus before revisiting
remaining work), acquired the full SMT-LIB 2024 QF_BV release
(Zenodo 10.5281/zenodo.11061097): 46,191 benchmarks, of which
15,435 contain bvmul and are declared unsat.

Headline finding: triaging the full corpus surfaced the 'float'
family (FP-as-BV, Haller-Griggio-Brain-Kroening FMCAD 2012),
absent from our 66-benchmark sample, where our algebraic solver
DOMINATES due to wide-multiplier structure (53x53->106-bit
significand products):

  On 75 float bvmul-unsat benchmarks @ 60s (low-contention):
    ours     68 solved
    Bitwuzla 51 solved
    cvc5      6 solved
    18 CONFIRMED unique wins (standalone: neither Bitwuzla nor
       cvc5 solves within 60s; all declared unsat; our verdict
       agrees with Bitwuzla everywhere both solve -> sound).

This adds 18 to the paper's previous 4 wins-beyond-all-solvers.

Methodological note recorded in RESULTS.md: a first pass at
36-way parallelism (= core count) manufactured spurious
timeouts via contention (newton.3.3.i 'won' at 36-way but
Bitwuzla solves it standalone in 31s). All win claims
re-confirmed at 1-6 way parallelism. Evaluation should run at
<= cores/4 way.

Committed (repo kept lean; full 35GB corpus stays in staging):
  bench-multiplication/float-fp2bv/
    pow5.smt2, mul_03_30_4.smt2  -- two smallest clean wins
    MANIFEST.tsv                 -- sha256+size for all 75 (re-fetch)
    results-{ours,bitwuzla,cvc5}.tsv -- full per-benchmark evidence
    run-any.sh, run-corpus.sh    -- parallel evaluation harness
    RESULTS.md                   -- writeup + reproduction steps

Also updated doc/paper-algebraic/benchmark-sources.md with the
acquisition status and the AIG->SMT-LIB converter blocker for
the Tier B Konrad gate-level archives.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
The float-family discovery reshapes the priority list:

- Add Item 11 'Exploit the float (FP-as-BV) family' as NEW HIGH
  PRIORITY. 18 unique wins-beyond-all-solvers on a canonical
  third-party benchmark family (quadruples headline wins 4->22).
  Four sub-tasks: adopt into eval corpus, diagnose 7 misses,
  write into paper, probe rest of FP. Sub-tasks 1+3 need no new
  code (data already collected).

- Rewrite Recommended sequencing into four tiers:
    Tier 1 (now): Item 11.1+11.3 -- adopt float + write paper
      result. ~2.5 days, biggest headline gain, zero new code.
    Tier 2: Item 11.2 (diagnose misses) + Item 10 Phase 1
      (fragment characterisation) + Item 3 (Sage2 scalability,
      now better quantified: 24 unique fails in stratified run).
    Tier 3: Item 10 Phase 2 gate extensions (SBIF, vanishing
      removal), prioritised by the miss diagnosis.
    Tier 4: Item 7 gate-level (Konrad AIGs downloaded, awaiting
      an AIG->SMT-LIB converter).

- Update top-of-file status with the +18 float wins.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
A co-author asked whether FP operations actually traverse our
algebraic path (we use float_utilst, not float_bvt). Checking
this disproved the previous commit's float-family claim:

1. The float benchmarks are PURE QF_BV -- no Float sort, no fp.*
   operators (the fp.iN tokens are just variable names declared
   (_ BitVec N)). So no FP lowering runs; float_utilst/float_bvt
   is irrelevant here.

2. Ablation (DISABLE_ALGEBRAIC=1) shows the algebraic path is
   NOT responsible for the float 'wins' -- and is net overhead:
     pow5            1.28s ON == 1.28s OFF
     test_v7_..s703  31.2s ON ->  15.3s OFF (FASTER without algebra)
     newton.3.3       8.3s ON ->   6.7s OFF
   All still solve unsat with algebra OFF via bit-blasting +
   MiniSAT (pow5: 171 SAT conflicts). The wins over Bitwuzla/cvc5
   are a SAT-backend artifact, not evidence for the algebraic
   method.

Corrections:
- RESULTS.md: prominent CORRECTION block + retraction note on
  the headline. Data retained for the record.
- benchmark-sources.md: headline corrected.
- remaining-work.md: top status corrected (count stays 4, not
  22); Item 11 marked RETRACTED with the ablation evidence.

New Item 12 (HIGH PRIORITY, Tier 0): ablation re-audit of ALL
claimed algebraic wins. The float episode showed we lacked
ablation discipline -- a 'win' only supports the thesis if it
survives DISABLE_ALGEBRAIC=1. Demonstrated the audit
distinguishes genuine from artifact:
    SABER n16-bw13: 0.02s ON vs 60s TIMEOUT OFF  -> GENUINE
    assoc_64:       0.01s ON vs 0.01s OFF         -> ARTIFACT
Recommended-sequencing reordered: Item 12 is now Tier 0,
gating credibility of all headline numbers before submission.

The corpus widening itself (full QF_BV acquired + triaged,
harness built) remains valid; only the float attribution is
retracted.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…S BUGS

Ran every paper claim algebra-ON vs DISABLE_ALGEBRAIC=1, single-
threaded, cross-checked against z3/Bitwuzla/declared-status.

GENUINE (survive ablation, sound):
  - 4 Brain wins-beyond-all-solvers: NEEDS_ALGEBRA, sound
  - 5 SMT-COMP unlocks (cohencu_0..3, geo3.c_5): NEEDS_ALGEBRA
  - 41/66 = 31 bit-blast + 10 algebra (restate in paper as 10/31)
  - SABER: 26/31 NEEDS_ALGEBRA + 1 ALGEBRA_FASTER (cleanest)
  - DSP mac_equiv: ALGEBRA_FASTER (0.14s vs 45.8s)
  No unsoundness in the UNSAT direction.

CRITICAL FINDING - 2 soundness bugs (algebra says SUCCESSFUL,
bit-blast+z3 say FAILED; the dangerous direction):
  - assoc.c (BW=9): assertion operands cast to signedbv[32] (C
    integer promotion); extractor reasons in 9-bit ring, ignores
    the cast, proves a false-at-32-bit identity.
  - mul_zero_factor.c (BW=8): {a*b=0, a!=0, b!=0} is SAT over
    Z/2^8 (zero divisors: 16*16=0) but algebra refutes it
    (assumes integral-domain/field; no zero divisors).
  div_roundtrip = ARTIFACT (bit-blast also fast).

Added Item 13 (CRITICAL, blocks submission) with fix directions
for both bugs + verification plan + Lean reconciliation note.
Item 13 is now Tier 0 ahead of everything. Custom-suite
'39/39 validates correctness' claim must be withdrawn.

Artifacts: doc/paper-algebraic/ablation-audit-2026-06-01.md;
bench-multiplication/ablation-audit/{smtcomp66,brain-wins,saber,
custom-c}.tsv + ablation{,-cbmc}.sh harness.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
The Item 12 ablation audit found two cases where the algebraic
pre-solver reports VERIFICATION SUCCESSFUL when the correct answer
(per bit-blasting and z3) is FAILED -- claiming no bug where a bug
exists. Both are fixed by conservative guards in
boolbvt::try_algebraic_solve that fall back to bit-blasting on the
unsound patterns; genuine algebraic wins are unaffected.

Bug A -- widening cast of a defined intermediate (assoc.c).
C integer promotion encodes `(a*b)*c == a*(b*c)` on 9-bit operands
as cast(ab,signedbv[32])*cast(c,signedbv[32]) == ..., where ab is
the SSA symbol defined by `ab = a*b` truncated to 9 bits. The
single-ring extractor pinned the ring at 9 bits (from the defining
equations) and silently absorbed the 32-bit cast, treating ab as
the exact 9-bit product inside a 32-bit multiply -- so it "proved"
the (9-bit-only) associativity identity. Fix: drop any
(dis)equality in which a widening typecast of a defined-intermediate
symbol is a DIRECT operand of arithmetic (* + - unary-). This is
precise: widening a primary INPUT (mul_overflow's cast(a,32)*cast(b,32))
is a faithful zero/sign extension and is kept; a widen-then-narrow
(matrix_mul's cast(cast(c00,32),8)) is not an arithmetic operand and
is kept; SMT-LIB benchmarks carry no ID_typecast and are untouched.

Bug B -- zero divisors over ZMod(2^d) (mul_zero_factor.c).
A disequality x!=0 is refuted via the Rabinowitsch trick x*e-1=0,
which asserts x is a UNIT. Over a field that equals x!=0, but over
ZMod(2^d) a non-zero element need not be invertible, so
{a*b=0, a!=0, b!=0} (SAT via 16*16=0 mod 256) was wrongly refuted.
Fix: detect the zero-divisor shape -- a product (directly, or via an
SSA symbol definition) constrained to 0 whose two factors are each
separately constrained non-zero -- and skip algebraic solving.
Genuinely-unsat refutations (e.g. cohencu) carry no such pattern.

Verification (ablation re-audit, algebra ON vs DISABLE_ALGEBRAIC=1):
- assoc.c (BW=9) and mul_zero_factor.c (BW=8): now FAILED, matching
  bit-blast + z3.
- No regression: SMT-COMP 66 identical (31 ARTIFACT / 25 BOTH_TO /
  10 NEEDS_ALGEBRA, all 10 unlocks preserved); SABER identical
  (26 NEEDS_ALGEBRA + 1 ALGEBRA_FASTER); 4 Brain wins still
  NEEDS_ALGEBRA; custom C wins (mul_overflow, comm_64, mac_equiv,
  matrix_mul, hash_mul, distrib, mul_square) still SUCCESSFUL and
  fast (algebra preserved).
- groebner unit tests 11/11 pass; smt2_solver regression passes.

Added regression tests (expected FAILED) so neither bug can
silently return:
  regression/cbmc/algebraic-soundness-assoc/
  regression/cbmc/algebraic-soundness-zero-divisor/

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
… 14)

After the two targeted Item 13 fixes (assoc, mul_zero) were
committed, a broad scan ran the solver (algebra ON) on 3,677 SAT
bvmul benchmarks from the full QF_BV corpus and flagged any
returned `unsat` (the dangerous direction). Result: 76 wrong
`unsat`. Splitting by DISABLE_ALGEBRAIC=1:

- 41 ALGEBRA-CAUSED (Sage2 x40, sage x1): genuine unsoundness in
  our algebraic pre-solver, NOT fixed by the two targeted guards.
- 35 PRE-EXISTING (all float): wrong `unsat` even with algebra
  OFF -> a pre-existing CBMC bit-blaster/SMT2 front-end issue,
  outside the algebraic pre-solver; flagged separately.

Root cause is systemic: the Rabinowitsch unit-trick
(diff != 0 |-> diff*e-1 = 0, "diff is a unit") is unsound over
ZMod(2^d) since non-zero != unit, so UNSAT_rabinowitsch does not
imply UNSAT_original. Used at >=3 sites (__rabinowitsch, __rab,
__rab_disj), plus a second distinct mechanism (is_zero()/gb-on-
equalities) accounts for 4 residual that stay wrong even with all
Rabinowitsch sites gated.

Feasibility (experimental DISABLE_RABINOWITSCH gate, reverted):
gating the 3 Rabinowitsch sites fixes 37/41; SABER + 4 Brain wins
survive (sound vanishing path); but cohencu_0..3 + geo3.c_5
regress to timeout (they depend on Rabinowitsch); 4 residual
remain. So a naive sound-only mode regains soundness at the cost
of ~5 SMT-COMP unlocks and still needs the second mechanism
fixed.

Docs:
- ablation-audit-2026-06-01.md: ADDENDUM with the full scan,
  root-cause analysis, feasibility data, design options.
- remaining-work.md: Item 13 -> PARTIAL (down payment); new
  Item 14 (CRITICAL, blocks submission) with the systemic
  analysis, three design options (sound-only ideal-membership;
  unit-aware Rabinowitsch; Lean reconciliation), and Tier-0
  re-prioritisation.
- bench-multiplication/ablation-audit/sat-scan-*.txt: the 41
  algebra-unsound and 35 pre-existing benchmark lists.

The two committed targeted fixes remain correct and tested but
are only a down payment. The algebraic pre-solver is unsound by
default over ZMod(2^d) on ~41 real benchmarks; the paper's
soundness claim cannot stand until Item 14 is resolved.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Answers the concern: we have a Lean soundness development yet the
audit found unsoundness. The proofs are correct; both bugs live
OUTSIDE the verified boundary at the encoding seam:

- Bug A: the Lean model BVExpr is single-width; cast is same-width
  identity; the model cannot represent a width-changing cast. The
  C++ documented this (ASSUMES: casts do not change outer width)
  but the 'MAINTAINED BY' predicate (set_bitwidth )
  did the opposite -- it absorbed wider casts into the narrow
  ring. Documented assumption, not enforced.

- Bug B: the Rabinowitsch disequality encoding (diff!=0 |->
  diff*e-1=0) has NO // PROOF: annotation, NO TRACEABILITY row,
  and is NOT in the BVExpr model (no disequality constructor / no
  e-variable). It is false over ZMod(2^d) (asserts diff is a
  unit). GroebnerSoundness.lean even contains two_not_isUnit but
  filed it under 'incompleteness', never stating the soundness
  bridge that would have required an IsUnit hypothesis.

Why traceability missed it: the bi-directional audit checked that
every // PROOF: ref resolves and every row maps to code -- but
only over the ANNOTATED set. An unannotated soundness-critical
step is invisible to both directions.

Item 14 fix will also close the gap: (1) mechanise the
disequality-encoding soundness obligation (which fails to prove
the unconditional Rabinowitsch step -> forces ideal-membership /
IsUnit side-condition); (2) enforce the single-width assumption
in code; (3) add a coverage check so every soundness-critical
step must be annotated (fail CI otherwise).

doc/paper-algebraic/proof-gap-analysis-2026-06-01.md

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…tsch)

The algebraic pre-solver wrongly reported unsat on 41 real SMT-LIB
SAT benchmarks (broad-scan finding behind Item 13). Root cause: the
Rabinowitsch unit-trick is unsound over ZMod(2^d). Encoding diff != 0
as diff*e - 1 = 0 asserts diff is a UNIT; over ZMod(2^d) a non-zero
element need not be invertible (e.g. 2 mod 4), so UNSAT of the
Rabinowitsch system does NOT imply UNSAT of the original disequality
query. Diagnosis showed 38/41 via the three Rabinowitsch sites and a
further 3 (Sage2/bench_{12880,13209,5552}) via a tree-walk leaf
equality whose bit-decomposition operators (bitand/lshr) overconstrain
the equality system F.

Fix (src/solvers/flattening/boolbv.cpp::try_algebraic_solve):
- Remove the Rabinowitsch unit-trick at all three sites (the
  __rabinowitsch push to the combined system, the single __rab
  per-disequality pass, and the disjunctive __rab_disj pass).
- Promote the sound ideal-membership refutation to default-on: after
  Buchberger, reduce each diff = lhs-rhs modulo the Groebner basis of
  the equalities F; remainder 0 => diff in <F> => diff = 0 on every
  common zero of F => the disequality is UNSAT. Sound over any
  commutative ring (no unit assumption). The per-disequality vanishing
  test is retained (also sound).
- Add a contains_bit_decomp filter so tree-walk leaf equalities
  containing bitand/bitor/bitxor/lshr/ashr/bitnot are not fed into F
  (sound: underconstraining F never turns SAT into UNSAT).
- Annotate every UNSAT-conclusion site with a // PROOF: reference.

Proof gap closed:
- formal-proofs/DisequalityRefutation.lean (zero sorry):
  diseq_refutation_sound / diseq_refutation_unsat (ideal membership =>
  sound refutation, any commutative ring) and
  rabinowitsch_unsound_over_zmod (concrete Z_4 witness: 2 != 0 yet no
  e with 2e = 1), plus ideal_membership_declines_witness.
- scripts/check_proof_traceability.py: fails if any // PROOF:
  reference does not resolve to a real Lean theorem, or if any UNSAT
  conclusion in try_algebraic_solve lacks a // PROOF: annotation. On
  first run it caught four pre-existing DRIFTED references (renamed
  theorems / wrong file) that the manual audit had missed; those are
  corrected here (smt2_parser, poly_ring, poly_extract, boolbv).
- TRACEABILITY.md rows added.

Verification:
- Broad SAT scan (3,677 SAT bvmul benchmarks): 0 non-float wrong-unsat
  (was 41). The 35 float wrong-unsat are pre-existing (reproduce with
  DISABLE_ALGEBRAIC=1) -- separate CBMC bit-blaster/SMT2 issue.
- SABER 4/4 and Brain 4/4 preserved; cohencu_0/1 preserved via the
  sound ideal-membership path (Bitwuzla confirms unsat).
- Lost: cohencu_2, cohencu_3, geo3.c_5 (3 SMT-COMP unlocks) -- the
  accepted soundness cost. SMT-COMP NEEDS_ALGEBRA 10 -> 6.
- groebner unit tests 11/11; assoc/mul_zero still FAILED; new
  regression test regression/smt2_solver/algebraic-rabinowitsch-
  soundness passes; coverage check passes; build clean, no warnings.

Outstanding (Item 14b, docs): the paper still presents and claims
soundness for the Rabinowitsch trick (paper.tex Theorem proof sketch,
contributions table, disequality prose) and must be rewritten.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…er rewrite

Traced where the (now-removed) Rabinowitsch trick came from, to inform
the Item 14b paper rewrite. It was not invented here: it is the
classical Nullstellensatz device (f != 0 |-> exists y. y*f = 1),
adopted directly from the explicit future-work suggestion of the paper
we build on, Song et al. 2024 (Equational Bit-Vector Solving via Strong
Groebner Bases), Section 7:

  "extend our methods to more bit-vector arithmetics, such as bitwise
   operations and inequalities by, e.g., the Rabinowitsch trick
   [19, Ch. 4.2, Prop. 8], [29]."

Both references they cite are FIELD settings where the trick is sound:
[19] Cox-Little-O'Shea, Ideals, Varieties, and Algorithms, Ch. 4
(Nullstellensatz over an algebraically closed field); [29]
Hader-Kaufmann-Kovacs, SMT solving over finite field arithmetic.

Consequence for the paper: this is a scholarly CORRECTION, not just a
disclaimer. We are the first to implement that suggested extension in
the finite-RING setting ZMod(2^d), and we show it is unsound there
(ZMod(2^d) is not a field; the trick encodes "f is a unit", and a
nonzero non-unit such as 2 in Z_4 breaks it -- see
DisequalityRefutation.lean::rabinowitsch_unsound_over_zmod). The cited
field results remain correct; we caution against importing them to
finite rings, and use the sound ideal-membership / vanishing refutation
that the SCA multiplier-verification literature uses.

- Added references: cox2015ideals, hader2023finitefield.
- remaining-work.md Item 14b: provenance + recommended framing.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…e width bug

Replaces the incomplete ideal-membership-only disequality handling with
Song et al.'s ring-aware encoding z*(a-b) - 2^{d-1} = 0, which is
equisatisfiable with a != b over ZMod(2^d) (Song et al. Prop. 5/6) and
hence sound AND complete for the refutation (UNSAT) direction. Added at
the single- and disjunctive-disequality sites in try_algebraic_solve;
the vanishing test remains as a fast sound pre-check.

Root-caused and fixed a PRE-EXISTING soundness bug exposed by the
encoding (and also latent in the previous ideal-membership path):
extract_predicate could emit the spurious constant `1` (i.e. 1 = 0) for
a width-inconsistent signed comparison. A signed `0 <s x` is
canonicalised to `x >=s 1`, sign-XOR-transformed into a lower bound
whose constant (2^31+1) was computed at width 32 while the ring width
was d=16; 2^31+1 >> 2^16 tripped the "bound impossible" branch,
poisoning F so the main Groebner check reported a spurious UNSAT.
Fix: a faithful d-bit constant always satisfies C < 2^d, so if
C >= 2^d the operands are width-inconsistent and we return nullopt
(defer to bit-blasting) instead of emitting an unsound constant. Never
fires on well-formed predicates (no completeness loss).

Completeness improvement: has_constant now refutes on ANY non-zero
constant, not only odd/unit ones. A constant evaluates to itself under
every assignment but every ideal element must evaluate to 0, so a
non-zero constant (even an even one) admits no solution. This is needed
for completeness of the 2^{d-1} encoding, whose refutation witness may
be even.

Lean (formal-proofs/DisequalityRefutation.lean, zero sorry):
- nonzero_constant_no_solution: any non-zero constant in the ideal =>
  no solution (generalises the odd/unit case; backs has_constant).
- song_encoding_equisat: Song Prop. 5/6, u != 0 iff exists z, z*u =
  2^{d-1}, with the forward direction proved by the 2-adic valuation
  argument (u = 2^a * odd, a < d) reusing the odd=>unit fact.

Verification:
- Broad SAT scan (3,677 SAT bvmul benchmarks): 0 non-float wrong-unsat
  with full Song enabled. The 35 float wrong-unsat are pre-existing
  (reproduce with DISABLE_ALGEBRAIC=1), a separate bit-blaster issue.
- SABER, Brain, and cohencu_0/1 (Bitwuzla-confirmed unsat) preserved.
  cohencu_2/3 + geo3.c_5 remain undecided (time out) -- they are
  unverifiable (z3/cvc5/Bitwuzla all time out at 300s; declared
  unknown) and their old "unsat" came from the unsound unit-trick, so
  not claiming them is correct.
- groebner unit tests 11/11 (the even-constant case now correctly
  UNSAT); cbmc + smt2_solver regression tests pass; coverage check
  passes; build clean, no warnings.

Soundness is established; this is a sound UNSAT pre-solver (not a
standalone decision procedure) -- non-refuted queries still defer to
bit-blasting, which provides overall completeness for QF_BV.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
3-assert ddmin minimisation (from Sage2/bench_2155, declared sat) of the
spurious-constant bug fixed in 2ff48a05bf, with a README explaining the
root cause and fix.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…nowitsch framing)

Updates the algebraic paper to reflect the implemented soundness fix
(Item 14). The textbook field Rabinowitsch trick is unsound over
ZMod(2^d); we adopt Song et al.'s ring-aware z*(a-b) - 2^{d-1} = 0
encoding, which is equisatisfiable with a != b (Song Prop. 5/6).

- Replaced the false "Soundness" theorem/proof (which claimed the
  unit-trick sound) with Theorem (Song Prop. 5/6): u != 0 iff exists
  z, z*u = 2^{d-1}, plus the non-zero-constant refutation criterion,
  citing the mechanised DisequalityRefutation.lean theorems.
- Added a remark documenting that the textbook Rabinowitsch trick is
  field-only (sound over fields per cox2015ideals / hader2023finitefield
  / Ozdemir23cvc5; unsound over the finite ring ZMod(2^d)), and that a
  prototype using it was unsound on real benchmarks.
- Contributions table + list, the Buchberger algorithm pseudocode
  (now refutes on ANY non-zero constant, not only odd), the two worked
  examples (commutativity, cohencu), the evaluation prose, the Lean
  soundness-contributions item, and the conclusion all updated from
  "Rabinowitsch" to Song's 2^{d-1} encoding.
- Related work: new "Scaling to large bitwidths" paragraph citing
  Hofstadler et al. (multimodular homomorphic images) and the SCA
  multiplier line (kaufmann2019, ritirc2017/2018), with an honest
  limitation: multimodular CRT does not transfer to native ZMod(2^d)
  reasoning (prime-power modulus; Boolean-gate model equivalence).
- Bib: added cox2015ideals, hader2023finitefield (earlier), and
  kaufmann2019, ritirc2017, ritirc2018, hofstadler2026.
- Corrected the now-stale Lean theorem count (135 -> over 150) to
  reflect the added DisequalityRefutation theorems.

Paper compiles cleanly (no undefined refs/citations).

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…w=16)

The cohencu-style-quadratic smt2_solver test expected UNSAT, but at
bw=32 that UNSAT was only reachable via the textbook Rabinowitsch
unit-trick removed in Item 14 (unsound over Z_{2^d}). With Song et
al.'s sound z*(a-b) - 2^{d-1} encoding the refutation no longer
completes within the regression budget at bw=32 (a completeness
limitation relative to the earlier unsound shortcut), so the test hung.

The query is genuinely UNSAT at any bitwidth (substituting the two
constraints makes (z*z+12) - (6z+12y) vanish identically). Reducing to
bw=16 keeps deterministic, oracle-verified (z3 + cvc5 both UNSAT)
coverage of the F4 tail-reduction / non-leading-term path while
remaining within budget. Comment updated to describe Song's encoding
and the bw=32 limitation.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…qual()

Commit d06d50c678 added redundant clauses (eq[i] OR eq[i+1]) for 10-13
bit equality checks, on the premise that "if bit i differs, the adjacent
bit must be equal". That premise is false: two adjacent bits can both
differ (e.g. op0=0b00, op1=0b11 has eq[0]=eq[1]=false), so the clause
over-constrains the SAT instance and makes satisfiable formulas
containing 10-13 bit equality checks spuriously UNSAT.

Found via a corpus soundness sweep: 98/98 declared-sat mcm/ benchmarks
and a family of float/ benchmarks returned unsat where z3, cvc5 and
upstream all return sat; git bisect identified d06d50c678 as the first
bad commit. Removing the clause flips all of them back to sat.

Regression test distilled (via ddsmt) from mcm/06.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
The assert command defers all assertions into deferred_assertions, which
are only encoded (solver.set_to_true) at solving time. check-sat flushes
them, but check-sat-assuming never did, so the entire regular assertion
stack was silently ignored under check-sat-assuming: e.g.
  (assert false) (check-sat-assuming ())
wrongly returned sat. Flush deferred_assertions in check-sat-assuming
exactly as check-sat does, before adding the temporary assumptions.

Covered by the existing regression test basic-bv1/check-sat-assuming1
(previously failing). Minimal reproducer distilled via ddsmt.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…nctions

Deferred bit-blasting collects assertions and replays them in
finish_eager_conversion() when the algebraic pre-solver does not refute.
The replay happened after functions.finish_eager_conversion(), which
generates the congruence axioms for uninterpreted-function applications
from the set of applications seen so far. A term occurring only inside a
deferred assertion (e.g. m(zero_extend(x))) was therefore registered too
late to receive its congruence axioms, dropping constraints and yielding
spurious sat. For example:

  (assert (= #x00000001 (m #x00000000)))
  (assert (= #x00000000 ((_ zero_extend 31) x)))   ; forces x = 0
  (assert (= #x00000000 (m ((_ zero_extend 31) x))))

is unsat (m(0) = 1 and, by congruence, m(zero_extend(x)) = m(0) = 0) but
was reported sat.

Move the replay before functions.finish_eager_conversion() (array axioms
in SUB::finish_eager_conversion already run after the replay, so they were
unaffected). The algebraic solve still runs first so a successful
refutation still short-circuits the replay. This fixes the bug at the
source; DISABLE_DEFER_BITBLAST=1 is no longer needed. Regression distilled
via ddsmt from uninterpreted-functions/uf1.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…type

The Level-3 candidate-assignment hint in try_algebraic_solve rebuilt each
variable's symbol as unsignedbv_typet(bw), where bw is the ring bitwidth
of the Groebner extractor. For a variable whose real width differs from
bw this created a width-inconsistent entry in boolbv_map: the subsequent
normal conversion of an assertion mentioning that symbol then tripped the
get_literals invariant (literal_map.size() == width) and aborted.

This crashed cbmc (SIGABRT) on Initialization6 and havoc_slice/test_struct_*
whenever the algebraic layer was active.

The hint is a soft, gate-retractable guidance clause, so it is sound to
reconstruct the symbol with its real type (collected from the typed
equation expressions) and to skip any variable whose width does not match
the ring width. Now no width-inconsistent map entry is ever created.

Verified: Initialization6 and havoc_slice/test_struct_{a,c,d} run to
completion again with the algebraic layer enabled; algebraic-soundness
tests and the full smt2_solver suite still pass.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…ttribution

Adds soundness-sweep-findings-2026-06-03.md documenting the corpus-scale
differential sweep, the four bugs it found (with ddsmt-minimised
reproducers and git-bisect root causes), and the fixes
(8a6f7b2ef1, 9c937299ab, c1ceecbdd0, 7fab6c5d8f).

Key correction: the float/mcm wrong-unsat results were previously
described in ablation-audit-2026-06-01.md as "pre-existing CBMC
bit-blaster" issues. A merge-base baseline build returns the correct
sat, and git bisect attributes them to branch commit d06d50c678 (an
unsound adjacent-equality SAT-encoding optimisation), now fixed.
Added correction notes at both spots in the audit doc.

The algebraic pre-solver itself remains sound: 0 wrong answers
attributable to it across the full bvmul sat corpus (4,361) plus a
4,000-file cross-corpus sample.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…suite + per-tweak ablation

Re-ran the evaluation with the now-sound binary (after the four
soundness/robustness fixes) and refreshed paper.tex:

- Random-polynomial: replaced the 210-benchmark random sample (not
  reproducible -- the harness used shuf) with the ENTIRE suite, all
  3498 benchmarks across both seeds and eight categories. Solved
  counts: shift-add 1972, Comba/CS 1970, algebraic 2128, Bitwuzla
  3021, cvc5 2678 (union 3104). Wins beyond all solvers: 59 (was 4 on
  the sample); over bit-blasting: 257 (was 18). Per-category table
  updated to full-suite denominators. Removed the stale per-benchmark
  timing table (and the 0.03-0.20 vs 0.3-3.7s discrepancy).
  Soundness: 0 sat-vs-unsat disagreements and 0 verdicts contradicting
  declared :status across all 3498 x 5 configurations.

- SMT-COMP sample (66): refreshed to shift-add 37/66 (PAR-2 57.1) and
  algebraic 43/66 (PAR-2 44.8); 0 disagreements; still overtakes cvc5
  and unlocks 8 benchmarks bit-blasting cannot decide.

- Added a leave-one-out per-tweak ablation (Table tab:tweak-ablation)
  on a 26-benchmark cross-suite subset: Tseitin propagation (4
  wienand), the vanishing test (3), and the boolean tree walk
  (geo3.c_5) are load-bearing for solvability; the finer tweaks
  (F4 interreduction, deferred bit-blasting, if-case elimination,
  nonzero fast-path, GB ordering) affect only running time.

- SABER 31/31, custom 39/39, DSP 5/5 re-confirmed (0 disagreements).

Raw results added under doc/paper-algebraic/data/. Paper compiles
cleanly (46 pages, no undefined refs).

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…tive, validation methodology)

Reconciles the paper with the post-fix code and resolves reviewer-flagged
inconsistencies:

- Stale numbers: intro and conclusion "4 wins" -> 59 (full-suite result);
  conclusion "18 Lean modules" -> 20 (matches the status summary and the
  actual module count).
- "No correctness regressions / 691 tests": corrected to the accurate CORE
  count (1167), states zero soundness/correctness regressions with the
  algebraic layers enabled, and notes the two remaining failures are
  non-algebraic timeouts (reproduce with the layer disabled). Added a
  validation-methodology paragraph (mechanised proof + corpus-scale
  differential sweep vs z3/cvc5/baseline) and disclosed that the sweep
  found and fixed integration bugs (the get_literals width abort).
- odd/unit-constant -> any non-zero constant: swept the prose and
  Algorithm-1 narrative; reframed the StrongGB negative results
  (naive_completeness_is_false, F={C 2}) as the motivation for
  generalising has_constant from odd-only to any non-zero constant
  (sound by nonzero_constant_no_solution), under which F={C 2} is now
  decided UNSAT; residual incompleteness (function-vanishing, budget)
  retained. Updated StrongGB.lean header to match.
- Defer.lean: documented the faithfulness ASSUMES (verdict depends on the
  committed set, not replay order) and that a real bug (UF-congruence
  replay ordering, found by differential testing, not by the proof)
  violated it; the fix restores it. Added as the third
  ASSUMES/MAINTAINED-BY story in the conclusion and proofs section.
- Theorem 1 (vanishing): made domain width n vs codomain width m explicit
  and scoped the small-n case to the mixed-width remark.
- SABER N=256: reconciled 11.25 s (Table saber-scaling) vs 13.4 s
  (Table saber-defer) as uninstrumented vs RSS-instrumented runs.
- Clarified the 150 theorems / 65 mapping entries / 37 source entities
  denominators.

Paper compiles cleanly (47 pages, no undefined refs). Lean edits are
docstring-only.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
These are patch/merge backup files (*.orig) that were committed by
accident; they are duplicates/leftovers of their live tracked
counterparts (or have no counterpart) and should never have been
tracked. Removing them; the live files are unaffected.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Prevents recurrence of the accidentally-committed build-cov/build-debug
trees, *.gcov/*.gcno coverage output, and *.orig patch backups.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Updates REPRODUCE.md to match the refreshed paper: the full
3498-benchmark random-polynomial run (replacing the old 210 sample),
the leave-one-out per-tweak ablation, the corpus differential
soundness sweep, the refreshed SMT-COMP/SABER/custom numbers, the
complete DISABLE_* ablation-variable table, and pointers to the new
data/*.tsv files and the soundness-sweep findings doc.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…smoke vs z3)

Wires two guards recommended in the paper:
- proof-traceability lint: runs scripts/check_proof_traceability.py (no
  build) so every PROOF: reference resolves and every UNSAT conclusion in
  try_algebraic_solve stays annotated;
- soundness smoke: builds smt2_solver and runs the algebraic regression
  queries (adjacent-equality, deferred-replay, rabinowitsch, cohencu-F4,
  check-sat-assuming), asserting each verdict matches the expected result
  AND agrees with z3 -- the wrong-verdict cross-check the standard
  output-matching regression harness does not provide.

New script scripts/algebraic_soundness_smoke.sh (also runnable locally).
Triggers on changes to the algebraic source, boolbv/bv_utils/smt2, the
proofs, and the smt2_solver regressions.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
The deferred-bit-blasting replay in finish_eager_conversion called
SUB::set_to directly, bypassing the boolbv_set_equality_to_true
optimisation that the eager (non-deferred) tail of set_to applies to
equalities. For queries with large equalities and no algebraic content
(e.g. union_large_array's 100k-element array union) this fully
bit-blasted the equality on replay, producing a far harder SAT instance
than the eager path and timing out (>300s vs <1s eager). git bisect
identified the deferred-bit-blasting default-on commit (d6f3c405a4) as
the first bad commit for this regression.

Replay now inlines the post-deferral tail of set_to (the
boolbv_set_equality_to_true fast path) so deferred replay reproduces the
eager clause structure, not just the eager models. union_large_array
returns to ~1s; algebraic regression, SABER, and the smt2_solver suite
are unaffected.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…rray) as found+fixed

Updates the validation paragraph: the union_large_array CORE timeout was
root-caused (deferred replay bypassed the boolbv_set_equality_to_true
eager optimisation, fully bit-blasting a 100k-element array-union
equality; bisected to the deferral-default commit) and fixed by applying
the equality optimisation on replay. Only a pre-existing thorough-tagged
constant-array timeout remains (also times out on the baseline).

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…2 45.8)

Re-measured the two cbmc rows of the SMT-COMP sample under the paper's
6GB memory limit (the prior 43/44.8 used 12GB): shift-add 37/66 (57.1),
algebraic 42/66 (45.8) -- one benchmark needs >6GB. Numbers confirmed
identical before and after the deferred-replay equality fix (no large
arrays in this sample). Caption now states the 6GB limit; raw data in
data/smtcomp-sample-{shiftadd,algebraic}-6gb.tsv.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…ding)

- Combined-effect: replace the misleading 'unlock eight (37/66 -> 42/66)'
  with the accurate net: algebra decides 8 that shift-add cannot while
  shift-add decides 3 the algebraic config times out on (net +5).
- Conclusion: 'a third bug' -> 'further integration bugs ... most
  tellingly the deferred-replay issue', consistent with the three
  differential-found fixes documented in Threats.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant