Use KEGG 118 HMM libraries in getKEGGModelForOrganism (#642)

edkerk · web-flow · commit aa5e2c670f2f · 2026-06-16T11:32:14.000+02:00
* Use KEGG 118 HMM libraries in getKEGGModelForOrganism

Point getKEGGModelForOrganism at the kegg118 pre-trained HMM sets published in
the raven-toolbox v0.3.0 release. The recognised dataDir suffixes become
euk90_kegg118 / prok90_kegg118, and the auto-download URL fetches
kegg118_&lt;domain&gt;.hmm.gz from the v0.3.0 release (previously kegg116 from v0.1.0).

Only the two kegg118 HMM sets (eukaryotes, prokaryotes) are supported; earlier
KEGG releases are no longer offered for download. Update the tutorial5 example to
use euk90_kegg118 accordingly.

* Name HMM sets by published asset name, drop parallel domain array

getKEGGModelForOrganism recognised dataDir suffixes euk90_kegg118 / prok90_kegg118
but built the download URL from a second, index-aligned array (eukaryotes /
prokaryotes), and the published asset is named differently again
(kegg118_eukaryotes). Three names for one artefact.

Standardise on the published asset name: dataDir is now kegg118_eukaryotes /
kegg118_prokaryotes, so the local directory, the local .hmm library, and the
downloaded asset all share one name. The hmmDomains/hmmIndex parallel array is
removed -- the matched dataDir suffix doubles as the download filename. Docstring
and the tutorial5 example updated accordingly.
diff --git a/reconstruction/kegg/getKEGGModelForOrganism.m b/reconstruction/kegg/getKEGGModelForOrganism.m
@@ -26,9 +26,9 @@
 %     keepUndefinedStoich, keepIncomplete and keepGeneral.
 % dataDir : char
 %     directory for which to retrieve the input data, styled as
-%     prok90_kegg116 or euk90_kegg116, indicating whether the HMMs were
-%     trained on pro- or eukaryotic sequences (first set of digits is the
-%     sequence similarity threshold, second set is the KEGG version). The
+%     kegg118_prokaryotes or kegg118_eukaryotes, indicating the KEGG version
+%     and whether the HMMs were trained on pro- or eukaryotic sequences. The
+%     directory name matches the published HMM library it is paired with. The
 %     prebuilt concatenated KO HMM library (dataDir.hmm) is downloaded here
 %     from the corresponding RAVEN release if not already present. May also
 %     contain a dataDir\keggdb sub-folder with a local KEGG FTP dump, used to
@@ -97,7 +97,7 @@
 %      maxPhylDist controls which organisms' annotations are considered.
 %   2. From protein homology (fastaFile supplied). The query proteome is
 %      searched, in a single hmmsearch, against a prebuilt KEGG-version- and
-%      domain-specific concatenated KO HMM library (e.g. kegg116_eukaryotes),
+%      domain-specific concatenated KO HMM library (e.g. kegg118_eukaryotes),
 %      downloaded from the corresponding RAVEN release if not already present
 %      in dataDir. Hits are filtered by cutOff and the minScoreRatioKO /
 %      minScoreRatioG ratios into a KO-gene matrix, from which the model is
@@ -200,8 +200,7 @@
 %gzip-compressed flatfile, queried in one hmmsearch); if it is not already
 %present it is downloaded from the corresponding RAVEN release.
 if ~isempty(dataDir)
-    hmmOptions={'euk90_kegg116','prok90_kegg116'};
-    hmmDomains={'eukaryotes','prokaryotes'}; %Aligned with hmmOptions
+    hmmOptions={'kegg118_eukaryotes','kegg118_prokaryotes'};
     if ~endsWith(dataDir,hmmOptions) %Check if dataDir ends with any of the hmmOptions.
         %If not, then check whether the required keggdb folder exists anyway.
         if ~isfile(fullfile(dataDir,'keggdb','genes.pep'))
@@ -210,8 +209,10 @@
     else
         %dataDir points to a RAVEN-provided set. Use the concatenated KO HMM
         %library (one gzip-compressed flatfile, queried in a single
-        %hmmsearch), downloading and extracting it if necessary.
-        hmmIndex=find(endsWith(dataDir,hmmOptions),1);
+        %hmmsearch), downloading and extracting it if necessary. The dataDir
+        %name matches the published HMM library asset, so it doubles as the
+        %download filename.
+        hmmName=hmmOptions{endsWith(dataDir,hmmOptions)};
         libraryFile=[dataDir '.hmm'];
         if isfile(libraryFile)
             fprintf(['NOTE: Found <strong>' libraryFile '</strong> HMM library, it will therefore be used during reconstruction\n']);
@@ -223,7 +224,7 @@
         else
             fprintf('Downloading the HMM library file... ');
             try
-                websave([libraryFile '.gz'],['https://github.com/SysBioChalmers/raven-toolbox/releases/download/v0.1.0/kegg116_' hmmDomains{hmmIndex} '.hmm.gz']);
+                websave([libraryFile '.gz'],['https://github.com/SysBioChalmers/raven-toolbox/releases/download/v0.3.0/' hmmName '.hmm.gz']);
             catch ME
                 if strcmp(ME.identifier,'MATLAB:webservices:HTTP404StatusCodeError')
                     error('Failed to download the HMM library file, the server returned a 404 error, try again later. If the problem persists please report it on the RAVEN GitHub Issues page: https://github.com/SysBioChalmers/RAVEN/issues')
diff --git a/tutorial/tutorial5.m b/tutorial/tutorial5.m
@@ -9,7 +9,7 @@
 % 
 % Start by downloading trained Hidden Markov Models for eukaryotes. This can
 % be done automatically or manually from the RAVEN Wiki in its GitHub
-% repository. In this tutorial, the archive "euk90_kegg105" is picked for
+% repository. In this tutorial, the archive "kegg118_eukaryotes" is picked for
 % the automatic download. See the documentation in the RAVEN Wiki for more
 % information regarding preparation of such archive.
 % 
@@ -19,7 +19,7 @@
 % are for. This process takes up to 20-35 minutes in macOS, Unix systems and
 % 40-55 minutes in Windows, depending on your hardware and the size of
 % target organism proteome
-model=getKEGGModelForOrganism('sce','sce.fa','euk90_kegg105','output',false,false,false,false,10^-30,0.8,0.3,-1);
+model=getKEGGModelForOrganism('sce','sce.fa','kegg118_eukaryotes','output',false,false,false,false,10^-30,0.8,0.3,-1);
 
 % The resulting model should contain around 1589 reactions, 1600
 % metabolites and 836 genes. Small variations are possible since it is an