-
Notifications
You must be signed in to change notification settings - Fork 17
Description
I am trying to figure out which settings and files to use to have the most complete and correct representation of a genome.
In the code I found the following type of output files:
REPLICON = 'assembled-molecule'
UNLOCALISED = 'unlocalised-scaffold'
UNPLACED = 'unplaced-scaffold'
PATCH = 'patch'
When downloading a genome, for example GCA_000003215.1
enaBrowserTools/python3/enaDataGet -f embl --wgs --extract-wgs --expanded GCA_000003215.1
It generates the following files:
-rw-r--r-- 1 root root 1746946 Oct 20 06:45 ABFD02.dat.gz
-rw-r--r-- 1 root root 5168 Oct 20 06:45 GCA_000003215.1.xml
-rw-r--r-- 1 root root 1242 Oct 20 06:45 GCA_000003215.1_sequence_report.txt
-rw-r--r-- 1 root root 5533183 Oct 20 06:45 assembled-molecule.dat
-rw-r--r-- 1 root root 0 Oct 20 06:45 wgs_scaffolds.dat
In this case I assume the assembled-molecule.dat is the most complete genome file?
It contains 1 chromosome with unknown gap sizes while the gzip file contains the 31 contigs separately.
Or would it be wiser to always use the gzipped file?