Best practices for batch processing SRA accessions on an HPC cluster

Dear NCBI SRA Support Team,

I am writing to ask for your guidance on the recommended best practices for a large-scale batch analysis.

I am  working on a university HPC cluster and need to extract 4 specific genomic regions (using sam-dump --aligned-region) from approximately 1,000 different SRR accessions (e.g., SRR1127217) for a research project.

I want to ensure we do this in the most efficient and respectful way possible. I am considering two potential workflows:

Direct Remote Query: Running sam-dump --aligned-region in a parallel SBATCH array, which would make ~4,000 separate remote queries to your servers.

Prefetch First: Running prefetch on all 1,000 accessions first to download the .sra files locally, and then running our sam-dump --aligned-region script on the local files.

Could you please confirm which of these is the correct and recommended workflow? We want to follow the proper procedure to avoid causing unnecessary load on your servers and to prevent our cluster's IP from being rate-limited or blocked.

Thank you for your time.

Best,
Tushar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practices for batch processing SRA accessions on an HPC cluster #1113

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Best practices for batch processing SRA accessions on an HPC cluster #1113

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions