Skip to content

Conversation

@IvaTutis
Copy link
Contributor

@IvaTutis IvaTutis commented Dec 10, 2025

A fasta reader that enables the user to read the sequences and stream/get as a string their slices.

Currently measured to process a 2GB single fasta entry with a large sequence in about 2.8 sec - so that's reading, parsing, indexing, complaining if the file is malformed etc etc.
Getting small slices of the sequence is almost instantaneous, while streaming the entire sequence will of course take longer, but around 3 sec is to be expected.

The user uses the reader through the interface in FastaFileService, the examples of usage are provided in FastaFileServiceIntegrationTest ("integration" because I am effectively testing the Fasta File Service together with the underlying SequentialFastaFileReader and the SequenceIndexBuilders and whatnot).

Requiring feedback on:

suggestions for a better interface in FastaFileService
structure
suggestions for optimizations
potential spotted bugs

The spec for the file I am reading is provided here.

Currently I am aware of the following potential bugs:

need check for trailing/leading Ns in case they span multiple lines
need to check if it's possible for there not to be a nextline character after the json in the header, and the sequence starts right after
probably should throw buffer sizes in some hardcoded constants file or appsettings

}

/** (4) count 'N'/'n' at the tail of the last sequence line only. */
private long countTrailingNs(LineEntry line) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we read from the end of the line to get the trailing N's?

Copy link
Contributor

@Rajkumar-D Rajkumar-D left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had initial comments on this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants