Skip to content

perf: route BitVec.flattenList through the chunked Array merge core#14097

Draft
kim-em wants to merge 7 commits into
leanprover:masterfrom
kim-em:flattenList-chunked-array-core
Draft

perf: route BitVec.flattenList through the chunked Array merge core#14097
kim-em wants to merge 7 commits into
leanprover:masterfrom
kim-em:flattenList-chunked-array-core

Conversation

@kim-em

@kim-em kim-em commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

This PR makes the compiled implementation of BitVec.flattenList share the chunked Array (Nat × Nat) balanced-merge core (mergePass/treeMerge) that BitVec.ofBoolListLE uses, replacing the divide-and-conquer flattenList.toNatAux as the @[csimp] target. The new BitVec.Internal.flattenListImpl packs chunkCap n = max 1 (64 / n) width-n values per ~64-bit chunk (packChunkBV, collectChunksBV), then tree-merges. This recovers O(1) stack usage and removes the per-node take/drop sublist allocation of the previous O(log L)-depth divide-and-conquer worker; for small widths the packing also collapses the bottom log₂ (chunkCap n) levels of the merge tree into single machine-word operations.

Correctness reuses the existing Array-core lemmas (treeMerge_eq_flattenList, flattenList_append, flattenList_pack): a chunk-local flattenList_collectChunksBV identity composes with a toNat_flattenList_eq bridge reconciling the head-high orientation of BitVec.flattenList with the head-low Nat × Nat spec (mirroring the reverse already used by ofBoolListBE). The now-superseded divide-and-conquer flattenListFast/flattenList.toNatAux and the flattenList_eq_flattenListFast theorem are removed.

Blocked by #13576 (fix: tail-recursive BitVec.ofBoolListLE/ofBoolListBE to avoid stack overflow): this branch is stacked on it and reuses its BitVec.Internal Array merge core, so the diff shown here includes that PR's changes until it merges. Review/merge #13576 first; this PR should then be retargeted/rebased onto the result.

🤖 Prepared with Claude Code

kim-em and others added 6 commits June 18, 2026 03:57
…ck overflow

This PR adds tail-recursive replacements for `BitVec.ofBoolListLE` and
`BitVec.ofBoolListBE`, registered via `@[csimp]`, to avoid stack overflow
on lists with ~1M elements.

The reference definitions in `Init.Data.BitVec.Basic` recurse via `concat`,
which is clean for proofs but allocates O(n) stack frames. The new
implementations in `Init.Data.BitVec.Impl` pack bits in 64-bit chunks
(`packChunk`, `collectChunks`) and combine them via a balanced tree merge
(`mergePass`, `treeMerge`), giving O(n log n) work and O(1) stack usage.

Correctness is established via a list-level spec function `flattenList`
giving the intended Nat semantics of `(value, width)` pairs, with
`flattenList_append`, `flattenList_mergePassList` (key bit-packing
identity), and a chunk-local `testBit_flattenList_collectChunks_aux`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
….Internal`

Codex review feedback applied:
- Mark `packChunk`, `collectChunks`, `mergePass`, `treeMerge` as `private` —
  these have unreachable-fuel branches that make them bad public contracts.
- Mark proof-only scaffolding (`flattenList`, `totalWidth`, `WellFormedList`,
  `mergePassList`) as `private` to avoid exposing them as API.
- Extract `half_le_pow_of_le_double` (the `arr.size ≤ 2^(k+1) → (arr.size+1)/2 ≤ 2^k`
  bound used in the `treeMerge` halving step) into a standalone lemma. This
  isolates the omega/`Int.pow_succ` workaround to one place and turns
  `treeMerge_go_eq_flattenList`'s arithmetic step into a one-liner.

Public surface is now just `ofBoolListLEImpl`, `ofBoolListBEImpl`, and the two
`@[csimp]` theorems.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the file-wide `public section` with `public` modifiers on just the
two `ofBoolListLEImpl`/`ofBoolListBEImpl` defs and the two `@[csimp]` theorems.
Everything else (the chunked-encoding helpers `packChunk`, `collectChunks`,
`mergePass`, `treeMerge`, the proof-only `flattenList`/`totalWidth`/
`WellFormedList`/`mergePassList`, and all auxiliary lemmas) is now file-local.

Also drop redundant imports `Init.Data.Nat.Lemmas` (transitive via
`Init.Data.Array.Lemmas`) and `Init.Data.List.Lemmas` (transitive via
`Init.Data.List.Nat.TakeDrop`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This PR marks `BitVec.ofBoolListLE` and `BitVec.ofBoolListBE` as
`noncomputable`, so the `@[csimp]` lemmas pointing them at their
tail-recursive implementations always take effect and there is no risk
of compiling against the non-tail-recursive reference definitions. This
also keeps the `elab/csimpCore.lean` test invariant that `@[csimp]` is
only applied to `noncomputable` defs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tenList` identities

This PR rewrites the correctness proofs in `Init.Data.BitVec.Impl` so that
every step is an unconditional `Nat` identity, removing the `WellFormedList`
invariant entirely. The key observation is that `flattenList_pack` — merging
two adjacent `(value, width)` fields — is pure `|||`/`<<<` algebra and needs
no `value < 2^width` hypothesis. With that hypothesis gone, `flattenList_mergePassList`,
`treeMerge_eq_flattenList`, and `flattenList_collectChunks` no longer need
well-formedness either, so `WellFormedList`, `flattenList_lt`,
`mergePassList_wellFormed`, `collectChunks_wellFormed`, and the bespoke
`testBit`-indexing lemmas all become dead code.

The runtime definitions (`packChunk`, `collectChunks`, `mergePass`, `treeMerge`,
`ofBoolListLEImpl`, `ofBoolListBEImpl`) are unchanged; benchmarked speed and all
test results are identical. `Impl.lean` shrinks from 680 to 450 lines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This PR shortens several proofs in `Init.Data.BitVec.Impl` without touching the
runtime definitions: it replaces hand-rolled `toList_size_one`/`toList_size_zero`
proofs with the existing `Array.size_eq_one_iff`/`List.length_eq_zero_iff`,
collapses the `cases r` branches of `packChunk_used`/`packChunk_rest`, drops the
trivial `mergePassList_nil`/`mergePassList_singleton` lemmas in favour of inline
`simp`, and golfs `mergePass_size`, `totalWidth_map_leaf`, and
`getLsbD_ofBoolListLEImpl`. `Impl.lean` shrinks from 450 to 426 lines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kim-em kim-em added the changelog-library Library label Jun 18, 2026
@github-actions github-actions Bot added the toolchain-available A toolchain is available for this PR, at leanprover/lean4-pr-releases:pr-release-NNNN label Jun 18, 2026
@mathlib-lean-pr-testing

Copy link
Copy Markdown

Mathlib CI status (docs):

  • ❗ Batteries/Mathlib CI will not be attempted unless your PR branches off the nightly-with-mathlib branch. Try git rebase 1fb7efad0255f2f227a68e7023c4c5a6a8e42265 --onto 4792cd22887c8b529a351f6563b693426ff2a8f8. You can force Mathlib CI using the force-mathlib-ci label. (2026-06-18 04:50:22)

@leanprover-bot

Copy link
Copy Markdown
Collaborator

Reference manual CI status:

  • ❗ Reference manual CI will not be attempted unless your PR branches off the nightly-with-manual branch. Try git rebase 1fb7efad0255f2f227a68e7023c4c5a6a8e42265 --onto 803553a556fd82fa1060efb0c43eda542130cb16. You can force reference manual CI using the force-manual-ci label. (2026-06-18 04:50:23)

@kim-em kim-em force-pushed the flattenList-chunked-array-core branch from cac13d7 to 2083268 Compare June 18, 2026 05:37
This PR makes the compiled implementation of `BitVec.flattenList` share the chunked
`Array (Nat × Nat)` balanced-merge core (`mergePass`/`treeMerge`) that
`BitVec.ofBoolListLE` already uses, replacing the divide-and-conquer
`flattenList.toNatAux` as the `@[csimp]` target. The new `BitVec.Internal.flattenListImpl`
packs `chunkCap n = max 1 (64 / n)` width-`n` values per ~64-bit chunk (`packChunkBV`,
`collectChunksBV`), then tree-merges, giving `O(1)` stack usage and avoiding the
`O(log L)`-depth recursion and the per-node `take`/`drop` sublist allocation of the
previous divide-and-conquer worker. For small widths the packing also collapses the
bottom `log₂ (chunkCap n)` levels of the merge tree into single machine-word operations.

Correctness reuses the existing `Array`-core lemmas (`treeMerge_eq_flattenList`,
`flattenList_append`, `flattenList_pack`): a chunk-local `flattenList_collectChunksBV`
identity composes with a `toNat_flattenList_eq` bridge that reconciles the head-high
orientation of `BitVec.flattenList` with the head-low `Nat × Nat` spec (mirroring the
reverse already used by `ofBoolListBE`). The divide-and-conquer `flattenListFast` is kept
as a reference implementation; `flattenList_eq_flattenListFast` remains as a plain theorem.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@kim-em kim-em force-pushed the flattenList-chunked-array-core branch from 2083268 to 6aace2f Compare June 18, 2026 06:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog-library Library toolchain-available A toolchain is available for this PR, at leanprover/lean4-pr-releases:pr-release-NNNN

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants