Skip to content

Best practices for batch processing SRA accessions on an HPC cluster #1113

@Tushar-bioinfo

Description

@Tushar-bioinfo

Dear NCBI SRA Support Team,

I am writing to ask for your guidance on the recommended best practices for a large-scale batch analysis.

I am working on a university HPC cluster and need to extract 4 specific genomic regions (using sam-dump --aligned-region) from approximately 1,000 different SRR accessions (e.g., SRR1127217) for a research project.

I want to ensure we do this in the most efficient and respectful way possible. I am considering two potential workflows:

Direct Remote Query: Running sam-dump --aligned-region in a parallel SBATCH array, which would make ~4,000 separate remote queries to your servers.

Prefetch First: Running prefetch on all 1,000 accessions first to download the .sra files locally, and then running our sam-dump --aligned-region script on the local files.

Could you please confirm which of these is the correct and recommended workflow? We want to follow the proper procedure to avoid causing unnecessary load on your servers and to prevent our cluster's IP from being rate-limited or blocked.

Thank you for your time.

Best,
Tushar

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions