Problem
fgumi sort uses significantly more peak memory than samtools for equivalent workloads, despite both tools using spill-to-disk external merge sort with the same memory limit (--max-memory 800M).
Benchmark data (60 M record BAM, --max-memory 800M, coordinate sort):
| Threads |
fgumi peak RSS |
samtools peak RSS |
| t1 |
17.5 GB |
1.1 GB |
| t2 |
16.2 GB |
1.8 GB |
| t4 |
15.0 GB |
3.7 GB |
| t8 |
8.0 GB |
7.2 GB |
The same pattern holds for all four sort orders. fgumi respects the spill threshold correctly (output is verified correct) but retains far more memory than the limit implies.
Root Causes Identified
1. Unbounded BTreeMap reorder buffer in io_writer_loop
src/lib/sort/bgzf_io.rs — io_writer_loop uses a BTreeMap<u64, Vec<u8>> to reorder out-of-order compressed blocks before writing. There is no bound on how many blocks can accumulate in this map. If compression workers produce blocks faster than the writer can flush them, the map grows without limit, each entry holding a full BGZF block (~64 KB). At high thread counts with many in-flight compress jobs this can hold hundreds of blocks simultaneously.
Fix: cap the reorder buffer depth (e.g. num_workers * 2 outstanding blocks), or use a fixed-size ring buffer that back-pressures the compress queue when the reorder buffer is full.
2. RecordBuffer allocation not counted against --max-memory
The sort memory limit governs TemplateRecordBuffer / RawRecordBuffer (the in-memory sort buffer before spilling). However, several other large allocations are outside this accounting:
- Per-worker decompressed block queues (
ArrayQueue of Vec<u8> blocks)
- Compressed block queues and in-flight compress job data
- The I/O writer's reorder
BTreeMap
- The
PooledBamWriter / PooledChunkWriter staging buffers
At t4 with --max-memory 800M, these untracked allocations can add several GB on top of the nominal limit.
Fix: include pool queue capacities and writer buffer sizes in the effective memory budget, or expose them as tunable parameters with sensible defaults.
3. mimalloc arena retention
fgumi uses mimalloc as its global allocator. mimalloc retains freed memory in per-thread arenas rather than returning it to the OS immediately. After a large spill phase (which allocates and frees many large Vec<u8> buffers), the RSS reported by ps remains elevated even though the logical heap is smaller. This inflates peak RSS measurements.
This is inherent to mimalloc's design and provides a real throughput benefit (fewer mmap/munmap calls). However it makes RSS comparisons with samtools (which uses the system allocator) misleading. The actual logical memory in use is lower than RSS suggests.
Mitigation: document the RSS inflation in --help and the README. Optionally add a --allocator system escape hatch for memory-constrained environments.
How samtools stays low
samtools pre-allocates a single contiguous buffer at startup and packs all bam1_t records into it. Its memory usage is strictly sizeof(bam1_t) + l_data per record, with no fragmentation. fgumi's streaming architecture (individual Vec<u8> per record, pool queues, async compression) is harder to pre-allocate but gives better throughput — the memory overhead is the price of parallelism.
Related
Identified during work on #237.
Problem
fgumi sortuses significantly more peak memory than samtools for equivalent workloads, despite both tools using spill-to-disk external merge sort with the same memory limit (--max-memory 800M).Benchmark data (60 M record BAM,
--max-memory 800M, coordinate sort):The same pattern holds for all four sort orders. fgumi respects the spill threshold correctly (output is verified correct) but retains far more memory than the limit implies.
Root Causes Identified
1. Unbounded
BTreeMapreorder buffer inio_writer_loopsrc/lib/sort/bgzf_io.rs—io_writer_loopuses aBTreeMap<u64, Vec<u8>>to reorder out-of-order compressed blocks before writing. There is no bound on how many blocks can accumulate in this map. If compression workers produce blocks faster than the writer can flush them, the map grows without limit, each entry holding a full BGZF block (~64 KB). At high thread counts with many in-flight compress jobs this can hold hundreds of blocks simultaneously.Fix: cap the reorder buffer depth (e.g.
num_workers * 2outstanding blocks), or use a fixed-size ring buffer that back-pressures the compress queue when the reorder buffer is full.2.
RecordBufferallocation not counted against--max-memoryThe sort memory limit governs
TemplateRecordBuffer/RawRecordBuffer(the in-memory sort buffer before spilling). However, several other large allocations are outside this accounting:ArrayQueueofVec<u8>blocks)BTreeMapPooledBamWriter/PooledChunkWriterstaging buffersAt t4 with
--max-memory 800M, these untracked allocations can add several GB on top of the nominal limit.Fix: include pool queue capacities and writer buffer sizes in the effective memory budget, or expose them as tunable parameters with sensible defaults.
3.
mimallocarena retentionfgumi uses
mimallocas its global allocator. mimalloc retains freed memory in per-thread arenas rather than returning it to the OS immediately. After a large spill phase (which allocates and frees many largeVec<u8>buffers), the RSS reported bypsremains elevated even though the logical heap is smaller. This inflates peak RSS measurements.This is inherent to mimalloc's design and provides a real throughput benefit (fewer
mmap/munmapcalls). However it makes RSS comparisons with samtools (which uses the system allocator) misleading. The actual logical memory in use is lower than RSS suggests.Mitigation: document the RSS inflation in
--helpand the README. Optionally add a--allocator systemescape hatch for memory-constrained environments.How samtools stays low
samtools pre-allocates a single contiguous buffer at startup and packs all
bam1_trecords into it. Its memory usage is strictlysizeof(bam1_t) + l_dataper record, with no fragmentation. fgumi's streaming architecture (individualVec<u8>per record, pool queues, async compression) is harder to pre-allocate but gives better throughput — the memory overhead is the price of parallelism.Related
Identified during work on #237.