Skip to content

perf(sort): reduce peak memory usage to be closer to samtools #238

@nh13

Description

@nh13

Problem

fgumi sort uses significantly more peak memory than samtools for equivalent workloads, despite both tools using spill-to-disk external merge sort with the same memory limit (--max-memory 800M).

Benchmark data (60 M record BAM, --max-memory 800M, coordinate sort):

Threads fgumi peak RSS samtools peak RSS
t1 17.5 GB 1.1 GB
t2 16.2 GB 1.8 GB
t4 15.0 GB 3.7 GB
t8 8.0 GB 7.2 GB

The same pattern holds for all four sort orders. fgumi respects the spill threshold correctly (output is verified correct) but retains far more memory than the limit implies.

Root Causes Identified

1. Unbounded BTreeMap reorder buffer in io_writer_loop

src/lib/sort/bgzf_io.rsio_writer_loop uses a BTreeMap<u64, Vec<u8>> to reorder out-of-order compressed blocks before writing. There is no bound on how many blocks can accumulate in this map. If compression workers produce blocks faster than the writer can flush them, the map grows without limit, each entry holding a full BGZF block (~64 KB). At high thread counts with many in-flight compress jobs this can hold hundreds of blocks simultaneously.

Fix: cap the reorder buffer depth (e.g. num_workers * 2 outstanding blocks), or use a fixed-size ring buffer that back-pressures the compress queue when the reorder buffer is full.

2. RecordBuffer allocation not counted against --max-memory

The sort memory limit governs TemplateRecordBuffer / RawRecordBuffer (the in-memory sort buffer before spilling). However, several other large allocations are outside this accounting:

  • Per-worker decompressed block queues (ArrayQueue of Vec<u8> blocks)
  • Compressed block queues and in-flight compress job data
  • The I/O writer's reorder BTreeMap
  • The PooledBamWriter / PooledChunkWriter staging buffers

At t4 with --max-memory 800M, these untracked allocations can add several GB on top of the nominal limit.

Fix: include pool queue capacities and writer buffer sizes in the effective memory budget, or expose them as tunable parameters with sensible defaults.

3. mimalloc arena retention

fgumi uses mimalloc as its global allocator. mimalloc retains freed memory in per-thread arenas rather than returning it to the OS immediately. After a large spill phase (which allocates and frees many large Vec<u8> buffers), the RSS reported by ps remains elevated even though the logical heap is smaller. This inflates peak RSS measurements.

This is inherent to mimalloc's design and provides a real throughput benefit (fewer mmap/munmap calls). However it makes RSS comparisons with samtools (which uses the system allocator) misleading. The actual logical memory in use is lower than RSS suggests.

Mitigation: document the RSS inflation in --help and the README. Optionally add a --allocator system escape hatch for memory-constrained environments.

How samtools stays low

samtools pre-allocates a single contiguous buffer at startup and packs all bam1_t records into it. Its memory usage is strictly sizeof(bam1_t) + l_data per record, with no fragmentation. fgumi's streaming architecture (individual Vec<u8> per record, pool queues, async compression) is harder to pre-allocate but gives better throughput — the memory overhead is the price of parallelism.

Related

Identified during work on #237.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions