perf(sort): reduce peak memory usage to be closer to samtools

## Problem

`fgumi sort` uses significantly more peak memory than samtools for equivalent workloads, despite both tools using spill-to-disk external merge sort with the same memory limit (`--max-memory 800M`).

Benchmark data (60 M record BAM, `--max-memory 800M`, coordinate sort):

| Threads | fgumi peak RSS | samtools peak RSS |
|---------|---------------|------------------|
| t1      | 17.5 GB       | 1.1 GB           |
| t2      | 16.2 GB       | 1.8 GB           |
| t4      | 15.0 GB       | 3.7 GB           |
| t8      | 8.0 GB        | 7.2 GB           |

The same pattern holds for all four sort orders. fgumi respects the spill threshold correctly (output is verified correct) but retains far more memory than the limit implies.

## Root Causes Identified

### 1. Unbounded `BTreeMap` reorder buffer in `io_writer_loop`

`src/lib/sort/bgzf_io.rs` — `io_writer_loop` uses a `BTreeMap<u64, Vec<u8>>` to reorder out-of-order compressed blocks before writing. There is no bound on how many blocks can accumulate in this map. If compression workers produce blocks faster than the writer can flush them, the map grows without limit, each entry holding a full BGZF block (~64 KB). At high thread counts with many in-flight compress jobs this can hold hundreds of blocks simultaneously.

**Fix**: cap the reorder buffer depth (e.g. `num_workers * 2` outstanding blocks), or use a fixed-size ring buffer that back-pressures the compress queue when the reorder buffer is full.

### 2. `RecordBuffer` allocation not counted against `--max-memory`

The sort memory limit governs `TemplateRecordBuffer` / `RawRecordBuffer` (the in-memory sort buffer before spilling). However, several other large allocations are outside this accounting:
- Per-worker decompressed block queues (`ArrayQueue` of `Vec<u8>` blocks)
- Compressed block queues and in-flight compress job data
- The I/O writer's reorder `BTreeMap`
- The `PooledBamWriter` / `PooledChunkWriter` staging buffers

At t4 with `--max-memory 800M`, these untracked allocations can add several GB on top of the nominal limit.

**Fix**: include pool queue capacities and writer buffer sizes in the effective memory budget, or expose them as tunable parameters with sensible defaults.

### 3. `mimalloc` arena retention

fgumi uses `mimalloc` as its global allocator. mimalloc retains freed memory in per-thread arenas rather than returning it to the OS immediately. After a large spill phase (which allocates and frees many large `Vec<u8>` buffers), the RSS reported by `ps` remains elevated even though the logical heap is smaller. This inflates peak RSS measurements.

This is inherent to mimalloc's design and provides a real throughput benefit (fewer `mmap`/`munmap` calls). However it makes RSS comparisons with samtools (which uses the system allocator) misleading. The actual logical memory in use is lower than RSS suggests.

**Mitigation**: document the RSS inflation in `--help` and the README. Optionally add a `--allocator system` escape hatch for memory-constrained environments.

## How samtools stays low

samtools pre-allocates a single contiguous buffer at startup and packs all `bam1_t` records into it. Its memory usage is strictly `sizeof(bam1_t) + l_data` per record, with no fragmentation. fgumi's streaming architecture (individual `Vec<u8>` per record, pool queues, async compression) is harder to pre-allocate but gives better throughput — the memory overhead is the price of parallelism.

## Related

Identified during work on #237.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(sort): reduce peak memory usage to be closer to samtools #238

Problem

Root Causes Identified

1. Unbounded `BTreeMap` reorder buffer in `io_writer_loop`

2. `RecordBuffer` allocation not counted against `--max-memory`

3. `mimalloc` arena retention

How samtools stays low

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

perf(sort): reduce peak memory usage to be closer to samtools #238

Description

Problem

Root Causes Identified

1. Unbounded BTreeMap reorder buffer in io_writer_loop

2. RecordBuffer allocation not counted against --max-memory

3. mimalloc arena retention

How samtools stays low

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. Unbounded `BTreeMap` reorder buffer in `io_writer_loop`

2. `RecordBuffer` allocation not counted against `--max-memory`

3. `mimalloc` arena retention