Skip to content

VDBManagerPathType() returns kptNotFound under stress #25

@jgans

Description

@jgans

I have been bench marking the loading of SRA records using the VDB API to stream sequence data (no quality or other info) on AWS. Similar to the fasterq-dump strategy, I am attempting to read each SRA record in parallel, but using the Message Passing Interface (MPI) instead of just threads. Each MPI rank opens and reads a non-overlapping slice of an SRA record.

For a number of parallel MPI ranks gets larger than about 32, I've noticed that VDBManagerPathType() starts returning kptNotFound for about 10% of the MPI processes. I've been able to work around this by retrying the call to VDBManagerPathType() after waiting 5 seconds. Is there a good way to read an SRA record in parallel, ideally using 100's of independent, but concurrent, processes? I am interested in extracting reads from an SRA file as fast as AWS will allow.

I was assuming that the data is stored in an S3 bucket and that parallel access would be okay. I'm not exactly sure where the data is being stored, since the srapath command returns:
https://locate.ncbi.nlm.nih.gov/sdlr/sdlr.fcgi?jwt=<long string of characters removed>.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions