Skip to content

soundness#235

Draft
a10y wants to merge 1 commit into
developfrom
aduffy/soundness
Draft

soundness#235
a10y wants to merge 1 commit into
developfrom
aduffy/soundness

Conversation

@a10y

@a10y a10y commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Fix OOB read in Decompressor::decompress_into

The bug

Decompression maps each code byte to a symbol with self.symbols.get_unchecked(code) /
self.lengths.get_unchecked(code). Those slices are only n_symbols long, but nothing
guaranteed the incoming code was in range. The escape code is 255, so any non-escape
code in [n_symbols, 254] — from corrupt input, a truncated stream, or a mismatched
symbol table — sailed past the escape handling and indexed the tables out of bounds.
That's an out-of-bounds read, i.e. UB, reachable from safe callers passing attacker- or
corruption-controlled bytes.

The fix

A code is valid iff code < n_symbols. The key idea: validate before every table
access
, so get_unchecked is only ever reached with an in-bounds index. No table
padding, no per-call copy, no change to the Decompressor type.

  • Fast 8-byte loop (escape_mask == 0): a single branchless SWAR check,
    any_byte_ge(next_block, n_symbols), validates all eight codes at once before the
    unchecked stores. This is the vectorized check.
  • Escape path: the leading codes (positions 0..first_escape_pos) are validated in
    one masked SWAR call. The raw escaped bytes are written directly (not table lookups),
    so they need no check.
  • Byte-at-a-time fallbacks: a scalar code >= n_symbols check before the access.

On a violation we set a flag, break 'decode, and panic after the decode region
rather than returning a length derived from invalid data. We never index out of bounds on
the way there. (The optimizer collapses the flag + final assert into a direct branch to a
single cold panic site — there is no per-iteration flag overhead.)

Decompressor::new now also asserts symbols.len() == lengths.len(). Both tables are
indexed by the same code, so equal length is what lets the single code < n_symbols
bound make both get_uncheckeds sound.

Why any_byte_ge is right

It's a branchless SWAR primitive using the hasmore/hasless bit-twiddling identities,
split at threshold 128 so each arm needs only one loop-invariant broadcast constant
(keeping register pressure low). It is covered by an exhaustive unit test that compares
it against a scalar reference over every threshold in 0..=257 crossed with uniform blocks
for all 256 byte values and single-notable-byte blocks in every lane position. If the SWAR
ever disagreed with "does any byte reach the threshold", that test fails — which is exactly
the property the soundness argument relies on (it must never report a block as valid when it
contains an out-of-range code).

Verification

  • cargo asm (ARM64/Apple Silicon) — the hot-loop check vectorizes as intended: all
    eight codes in ~3 ALU ops + a test, one hoisted constant, then a branch to the cold panic
    site:
    add x2, x17, x11            ; block + bias   (bias hoisted out of the loop)
    orr x2, x2, x17             ; | block
    tst x2, #0x8080808080808080 ; & HIGH
    b.eq <stores>               ; valid -> proceed, else -> panic
  • Miri — clean on all four invalid-code routes plus the roundtrips. The cases that were
    previously UB now panic with no UB reported.
  • Tests — exhaustive any_byte_ge unit test; four integration tests that drive the panic
    through each decode route (fast loop, escape-prefix, byte loop, tail loop); all pre-existing
    roundtrip tests; clippy (-D warnings) and rustfmt clean.

Performance (clean A/B vs develop, criterion baselines)

  • Normal trained-table decompression (cf8): within noise (~+1–3%). This is the genuine
    cost of adding a per-block validation to the hot path.
  • All-escape (empty symbol table, pathological): ~+13%.

The all-escape regression is not the check's runtime cost — that regime never executes the
check (every byte is an escape). The larger loop body leads LLVM to factor the loop latch into
a shared block (one extra unconditional jump per iteration) instead of develop's inline,
fused cmp/ccmp/b.lo latch. It's a code-layout artifact. I tried four ways to recover it
(drop the prefix check, #[inline(never)], a 1-constant SWAR, inline panic! in place of the
flag); all land at the same numbers, so it's inherent to growing the loop body. Accepted as
the cost of soundness on a degenerate regime — real, trained-table decompression is unaffected.

Reviewing the diff

src/lib.rs shows a large line count, but most of it is whitespace: the new
'decode { … } block re-indents the decode region. Use git diff -w to see the ~40 lines of
actual logic change.

Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@codspeed-hq

codspeed-hq Bot commented Jun 24, 2026

Copy link
Copy Markdown

Merging this PR will degrade performance by 13.73%

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

❌ 1 regressed benchmark
✅ 29 untouched benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation decompress-into-reuse 834.4 ns 967.2 ns -13.73%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing aduffy/soundness (2caf476) with develop (240b7b0)

Open in CodSpeed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant