-
Notifications
You must be signed in to change notification settings - Fork 0
Download UniProt databases
Gemy George Kaithakottil edited this page Jun 13, 2023
·
2 revisions
To download the UniProt databases and utilize additional functions such as excluding a taxon ID (--exclude_taxon_id
) or filtering the database to remove incomplete sequences and add manually curated splice variants (--filter_database
), you can use the download_from_uniprot
script provided by eifunannot
. This script requires execution on a workstation with internet access.
Example commands:
- Download all 'flowering plants' (3398) but exclude 'bread wheat' (4565), plus filter database [Recommended]
$ download_from_uniprot Magnoliopsida_NOT_Triticum_aestivum 3398 fasta --exclude_taxon_id 4565 --filter_database
# Output files are written to :
# - Uniprot_SwissProt_Magnoliopsida_NOT_Triticum_aestivum_3398_37537_22_05_23_0932.filtered.fasta
# - Uniprot_TrEMBL_Magnoliopsida_NOT_Triticum_aestivum_3398_13024003_22_05_23_0932.filtered.fasta
- Download all 'Viridiplantae' (33090)
$ download_from_uniprot Alveolata 33090 fasta
$ download_from_uniprot -h
usage: download_from_uniprot [-h] [--exclude_taxon_id EXCLUDE_TAXON_ID]
[--filter_database] [--size SIZE] [--progress]
taxon_name taxon_id
{fasta,xlsx,xml,dat,txt,gff,list}
Script to download data from Swiss-Prot and Trembl databases in different formats
positional arguments:
taxon_name Give any name you like. Join the taxon name with underscore(s) if giving multiple words. Do NOT use quotes.
taxon_id UniProt Taxon ID is the most importand field. Get it from UniProt first and use it here NOT from NCBI taxonomy ID
{fasta,xlsx,xml,dat,txt,gff,list}
Choose one of the output file format (default: fasta)
optional arguments:
-h, --help show this help message and exit
--exclude_taxon_id EXCLUDE_TAXON_ID
Provide UniProt Taxon ID you want to exclude. Get it from UniProt first and use it here NOT from NCBI taxonomy ID
--filter_database Filter the database to remove incomplete sequences. If enabled the format of download will be in 'dat' format and filtered output will be in 'fasta' format (default: False)
--size SIZE Always use size 500 as this will provide fast performance (default: 500)
--progress Report progress for downloads. Only available for fasta downloads (default: False)
where,
[format] - Choose one of the [format] below:
fasta: fasta
xlsx: Excel
xml: XML
dat|txt: Text
gff: GFF
list: List
Note:
50557 is the taxon for class Insecta
33090 is the taxon for class Viridiplantae
Example commands:
download_from_uniprot Plasmodium_falciparum 5833 dat
download_from_uniprot Plasmodium_falciparum 5833 txt
download_from_uniprot Alveolata 33630 fasta
# Download all 'flowering plants' (3398) but exclude 'bread wheat' (4565), plus filter database:
download_from_uniprot Magnoliopsida_NOT_Triticum_aestivum 3398 fasta --exclude_taxon_id 4565 --filter_database
Contact:Gemy George Kaithakottil([email protected])