-
Notifications
You must be signed in to change notification settings - Fork 8
Description
The below issue(s) questions were raised by Pascal.
Hi,
As already discussed, there is little reason to work with the OSD raw reads (other than technical or QC questions), so within the scope of the OSD analysis paper we should mostly all be using the workable reads (and assemblies, when available), but at this point I'm a little confused as to where the main workable reads may be hiding :)
Can I make a quick sanity-check with you about the location of the OSD raw and workable read data, taking for example ERR770976:
Raw reads (after demultiplexing + adapter clipping as defined on gitHub) are here:
https://www.ebi.ac.uk/ena/data/view/ERR770976
With two nice looking fastq PE files:
ftp://ftp.sra.ebi.ac.uk/vol1/ERA413/ERA413491/fastq/OSD73_R2_shotgun_raw.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/ERA413/ERA413491/fastq/OSD73_R1_shotgun_raw.fastq.gz
containing 1149852 * 2 reads
A certain flavor of workable reads (after quality trimming + length filtering as defined on gitHub) are here:
https://www.ebi.ac.uk/metagenomics/projects/ERP009703/samples/ERS667589/runs/ERR770976/results/versions/2.0
Which provides a link to the above raw reads, and a link to "Processed nucleotide reads (FASTA)":
https://www.ebi.ac.uk/metagenomics/projects/ERP009703/samples/ERS667589/runs/ERR770976/results/sequences/versions/2.0/export?contentType=text&exportValue=processedReads
which downloads a file which from its name look like merged reads:
ERR770976_MERGED_FASTQ_nt_reads.fasta
it contains 975769 reads, not incompatible with the raw PE read number above.
There are a number of questions:
1/ is this file ERR770976_MERGED_FASTQ_nt_reads.fasta the merged workable reads file?
2/ but then how come this file has an extra "ambiguous bas filtering" step which is not documented on gitHub in (as seen in the "Quality Control" tab of https://www.ebi.ac.uk/metagenomics/projects/ERP009703/samples/ERS667589/runs/ERR770976/results/versions/2.0)
3/ where are the workable un-merged reads which are described on gitHub as part of the "shotgun data the output files per sample".
4/ why is the "initial" (I guess "raw") number of reads (1,241,922) in the EMG "Quality Control" tab different from the raw read number on the ENA page above (1,149,852). Because the EMG analysis appears based on more reads than the supposedly official raw dataset, is it possible that EMG started with the "pre-raw" data set, ie the files directly obtained from the sequencing company?
So it seems that in fact the data located on the EMG portal are not the workable dataset described on giHub documentation, but instead an EMG specific dataset based on a different raw dataset, using a different read pre-processing?
I seem to remember we'd agreed that we should all be working with the same data sets, or else how are we to hope to compare and merge our results, plus the mat & met section is going to be one read pre-processing method for each OSD analysis partner?
Thanks ever so much for your help (and patience as this might have been said before - but trust me I did really do my best to search before writing this mail, so it might in fact just be an issue with documentation or my poor understanding).
Cheers,
Pascal