Skip to content

Genome download which files to use #53

@jjkoehorst

Description

@jjkoehorst

I am trying to figure out which settings and files to use to have the most complete and correct representation of a genome.

In the code I found the following type of output files:

REPLICON = 'assembled-molecule'
UNLOCALISED = 'unlocalised-scaffold'
UNPLACED = 'unplaced-scaffold'
PATCH = 'patch'

When downloading a genome, for example GCA_000003215.1

enaBrowserTools/python3/enaDataGet -f embl --wgs --extract-wgs --expanded GCA_000003215.1

It generates the following files:

-rw-r--r-- 1 root root 1746946 Oct 20 06:45 ABFD02.dat.gz
-rw-r--r-- 1 root root 5168 Oct 20 06:45 GCA_000003215.1.xml
-rw-r--r-- 1 root root 1242 Oct 20 06:45 GCA_000003215.1_sequence_report.txt
-rw-r--r-- 1 root root 5533183 Oct 20 06:45 assembled-molecule.dat
-rw-r--r-- 1 root root 0 Oct 20 06:45 wgs_scaffolds.dat

In this case I assume the assembled-molecule.dat is the most complete genome file?
It contains 1 chromosome with unknown gap sizes while the gzip file contains the 31 contigs separately.

Or would it be wiser to always use the gzipped file?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions