-
Notifications
You must be signed in to change notification settings - Fork 42
Description
I have been bench marking the loading of SRA records using the VDB API to stream sequence data (no quality or other info) on AWS. Similar to the fasterq-dump strategy, I am attempting to read each SRA record in parallel, but using the Message Passing Interface (MPI) instead of just threads. Each MPI rank opens and reads a non-overlapping slice of an SRA record.
For a number of parallel MPI ranks gets larger than about 32, I've noticed that VDBManagerPathType() starts returning kptNotFound for about 10% of the MPI processes. I've been able to work around this by retrying the call to VDBManagerPathType() after waiting 5 seconds. Is there a good way to read an SRA record in parallel, ideally using 100's of independent, but concurrent, processes? I am interested in extracting reads from an SRA file as fast as AWS will allow.
I was assuming that the data is stored in an S3 bucket and that parallel access would be okay. I'm not exactly sure where the data is being stored, since the srapath command returns:
https://locate.ncbi.nlm.nih.gov/sdlr/sdlr.fcgi?jwt=<long string of characters removed>.