Skip to content

Optimize FastxZipped.__next__ to use a single pass over records #281

@msto

Description

@msto

Summary

FastxZipped.__next__ currently makes multiple passes over records on every call:

  1. all(record is None for record in records) — check if all exhausted
  2. all_not_none(records) — check if any are None (truncation)
  3. [record.name for record in records] — extract names
  4. [self._name_minus_ordinal(name) for name in ...] — strip ordinals
  5. set(record_names) — check uniqueness

Since this runs on every record, there may be a performance concern for large FASTX files with many read groups.

These could be consolidated into a single loop that:

  • Detects None records (truncation or exhaustion)
  • Extracts and validates names in one pass
  • Short-circuits on name mismatch

Context

Raised in #259 (review comment by @nh13): #259 (comment)

Notes

The number of elements in records is bounded by the number of FASTX files being zipped (typically 2–4), so the constant factor may be negligible in practice. Worth benchmarking before optimizing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions