diff --git a/jgi_data_results/README.md b/jgi_data_results/README.md index dae8322..59bfa41 100644 --- a/jgi_data_results/README.md +++ b/jgi_data_results/README.md @@ -68,10 +68,10 @@ Results show six categories of taxonomic prediction accuracy: |----------|-------|-----------------| | **EXACT MATCH** | 15 | IMG and GTDB predictions match at species level | | **MATCH - Phylum level** | 45 | Predictions align at phylum level (most common case) | -| **PARTIAL - Same genus** | 14 | Same genus but different species prediction | +| **PARTIAL - Same genus** | 16 | Same genus but different species prediction | | **UNCLASSIFIED** | 13 | GTDB assigned "Unclassified Bacteria" (insufficient confidence) | | **MISMATCH - Different genus** | 6 | IMG and GTDB predictions differ at genus level | -| **MISSING GTDB DATA** | 7 | GTDB data unavailable for this IMG genome | +| **MISSING GTDB DATA** | 5 | GTDB data unavailable for this IMG genome | ### `img_llm_annotations.tsv` (Supplementary File) @@ -83,11 +83,11 @@ Contains the same 100 samples without GTDB annotations, showing only the origina 2. **Species-Level Exact Matches**: Only 15% show exact species-level matches, reflecting both taxonomic annotation methodology differences and potential genuinely different organism identifications. -3. **Genus-Level Partial Matches**: 14% of samples remain at the same genus but with different species predictions, suggesting fine-grained taxonomic differences. +3. **Genus-Level Partial Matches**: 16% of samples remain at the same genus but with different species predictions, suggesting fine-grained taxonomic differences. 4. **Unclassified Cases**: 13% received "Unclassified Bacteria" from GTDB, often indicating novel organisms or sequences with limited reference data. -5. **Missing Data**: 7% of IMG genomes lack GTDB coverage, highlighting coverage gaps in the GTDB database. +5. **Missing Data**: 5% of IMG genomes lack GTDB coverage, highlighting coverage gaps in the GTDB database. 6. **Mismatches**: 6% show true genus-level mismatches, potentially indicating annotation errors or novel taxonomy. diff --git a/jgi_data_results/img_llm_annotations_with_gtdb.tsv b/jgi_data_results/img_llm_annotations_with_gtdb.tsv index 3ec06ec..c4afc45 100644 --- a/jgi_data_results/img_llm_annotations_with_gtdb.tsv +++ b/jgi_data_results/img_llm_annotations_with_gtdb.tsv @@ -92,10 +92,10 @@ JGI sequencing project id IMG genome id File id File name Original upa Generated 1352409 2947692693 6226423ae99caa81935a9684 52655.2.412973.CAAGTGCA-CAAGTGCA.fastq.gz 239038/186/1 240032 https://narrative.kbase.us/narrative/240032 Actinobacteria bacterium 20805-2 Streptomyces althioticus MATCH - Phylum level (Matches at phylum level) 1214714 2816332240 5c3a97ec46d1e66b9ba89490 12804.1.287316.AATACGCG-CGCGTATT.fastq.gz 239038/187/1 240034 https://narrative.kbase.us/narrative/240034 Streptomyces albus J1074 VWB-mCherry-12 Streptomyces albidoflavus PARTIAL - Same genus 1340370 2944903374 61e8ce15b0df7a8c0db81a10 52640.1.404944.ACGATGAC-ACGATGAC.fastq.gz 239038/188/1 240028 https://narrative.kbase.us/narrative/240028 Actinomycetota bacterium 44427 Streptomyces sp900105755 MATCH - Phylum level (Matches at phylum level) -1248680 8130845799 67ae064768dd0c5de8e12ada 53090.2.581637.CCACTCGAGC-AGGACTCTTC.fastq.gz 239038/194/1 240223 https://narrative.kbase.us/narrative/240223 Streptosporangium nanhuense DSM 46674 Not found in log MISSING GTDB DATA +1248680 8130845799 67ae064768dd0c5de8e12ada 53090.2.581637.CCACTCGAGC-AGGACTCTTC.fastq.gz 239038/194/1 240223 https://narrative.kbase.us/narrative/240223 Streptosporangium nanhuense DSM 46674 Not found in log PARTIAL - Same genus 1053055 2596583657 545d5e010d87855284890b40 8465.8.102013.GACGAC.fastq.gz 239038/195/1 240224 https://narrative.kbase.us/narrative/240224 Xanthobacter autotrophicus DSM 432 Xanthobacter autotrophicus_A EXACT MATCH 1030857 2574179732 5329352d49607a1be00599a1 7779.3.83550.AAGCGA.fastq.gz 239038/198/1 240175 https://narrative.kbase.us/narrative/240175 Vibrio porteresiae DSM 19223 Vibrio porteresiae EXACT MATCH -1186048 8130828393 67d0972fd72b25923552b524 53102.8.586047.TCGTTCGTAA-CTCGAACCGG.fastq.gz 239038/199/1 240226 https://narrative.kbase.us/narrative/240226 Streptosporangium shengliense DSM 45881 (version 2) Not found in log MISSING GTDB DATA +1186048 8130828393 67d0972fd72b25923552b524 53102.8.586047.TCGTTCGTAA-CTCGAACCGG.fastq.gz 239038/199/1 240226 https://narrative.kbase.us/narrative/240226 Streptosporangium shengliense DSM 45881 (version 2) Not found in log PARTIAL - Same genus 1248546 8130794579 67d0973ed72b25923552b58d 53102.8.586047.ACGGGTGAGC-GTTTCCGTAC.fastq.gz 239038/202/1 240174 https://narrative.kbase.us/narrative/240174 Streptosporangium longisporum DSM 43180 (version 2) Unclassified Bacteria UNCLASSIFIED 1347351 2990676009 6267f6a1f21e5a14d08d81ce 52684.1.418857.GGTATAGG-GGTATAGG.fastq.gz 239038/203/1 240171 https://narrative.kbase.us/narrative/240171 Bradyrhizobium elkanii USDA 319 Unclassified Bacteria UNCLASSIFIED 1352244 2953085618 622f645fe99caa81935b477e 52664.1.415157.TTGGACGT-TTGGACGT.fastq.gz 239038/2/1 239695 https://narrative.kbase.us/narrative/239695 Brevundimonas sp. 003966 Brevundimonas EXACT MATCH