Skip to content

Download UniProt databases

Gemy George Kaithakottil edited this page Jun 13, 2023 · 2 revisions

Download UniProt databases

To download the UniProt databases and utilize additional functions such as excluding a taxon ID (--exclude_taxon_id) or filtering the database to remove incomplete sequences and add manually curated splice variants (--filter_database), you can use the download_from_uniprot script provided by eifunannot. This script requires execution on a workstation with internet access.

Example commands:

  • Download all 'flowering plants' (3398) but exclude 'bread wheat' (4565), plus filter database [Recommended]
$ download_from_uniprot Magnoliopsida_NOT_Triticum_aestivum 3398 fasta --exclude_taxon_id 4565 --filter_database

# Output files are written to :
# - Uniprot_SwissProt_Magnoliopsida_NOT_Triticum_aestivum_3398_37537_22_05_23_0932.filtered.fasta
# - Uniprot_TrEMBL_Magnoliopsida_NOT_Triticum_aestivum_3398_13024003_22_05_23_0932.filtered.fasta
  • Download all 'Viridiplantae' (33090)
$ download_from_uniprot Alveolata 33090 fasta

Usage:

$ download_from_uniprot -h
usage: download_from_uniprot [-h] [--exclude_taxon_id EXCLUDE_TAXON_ID]
                             [--filter_database] [--size SIZE] [--progress]
                             taxon_name taxon_id
                             {fasta,xlsx,xml,dat,txt,gff,list}

Script to download data from Swiss-Prot and Trembl databases in different formats

positional arguments:
  taxon_name            Give any name you like. Join the taxon name with underscore(s) if giving multiple words. Do NOT use quotes.
  taxon_id              UniProt Taxon ID is the most importand field. Get it from UniProt first and use it here NOT from NCBI taxonomy ID
  {fasta,xlsx,xml,dat,txt,gff,list}
                        Choose one of the output file format (default: fasta)

optional arguments:
  -h, --help            show this help message and exit
  --exclude_taxon_id EXCLUDE_TAXON_ID
                        Provide UniProt Taxon ID you want to exclude. Get it from UniProt first and use it here NOT from NCBI taxonomy ID
  --filter_database     Filter the database to remove incomplete sequences. If enabled the format of download will be in 'dat' format and filtered output will be in 'fasta' format (default: False)
  --size SIZE           Always use size 500 as this will provide fast performance (default: 500)
  --progress            Report progress for downloads. Only available for fasta downloads (default: False)

where,
[format]        - Choose one of the [format] below:
                  fasta: fasta
                  xlsx: Excel
                  xml: XML
                  dat|txt: Text
                  gff: GFF
                  list: List

    Note:

    50557 is the taxon for class Insecta
    33090 is the taxon for class Viridiplantae

    Example commands:
    download_from_uniprot Plasmodium_falciparum 5833 dat
    download_from_uniprot Plasmodium_falciparum 5833 txt
    download_from_uniprot Alveolata 33630 fasta

    # Download all 'flowering plants' (3398) but exclude 'bread wheat' (4565), plus filter database:
    download_from_uniprot Magnoliopsida_NOT_Triticum_aestivum 3398 fasta --exclude_taxon_id 4565 --filter_database

Contact:Gemy George Kaithakottil([email protected])
Clone this wiki locally