Skip to content

Commit aa5e2c6

Browse files
authored
Use KEGG 118 HMM libraries in getKEGGModelForOrganism (#642)
* Use KEGG 118 HMM libraries in getKEGGModelForOrganism Point getKEGGModelForOrganism at the kegg118 pre-trained HMM sets published in the raven-toolbox v0.3.0 release. The recognised dataDir suffixes become euk90_kegg118 / prok90_kegg118, and the auto-download URL fetches kegg118_<domain>.hmm.gz from the v0.3.0 release (previously kegg116 from v0.1.0). Only the two kegg118 HMM sets (eukaryotes, prokaryotes) are supported; earlier KEGG releases are no longer offered for download. Update the tutorial5 example to use euk90_kegg118 accordingly. * Name HMM sets by published asset name, drop parallel domain array getKEGGModelForOrganism recognised dataDir suffixes euk90_kegg118 / prok90_kegg118 but built the download URL from a second, index-aligned array (eukaryotes / prokaryotes), and the published asset is named differently again (kegg118_eukaryotes). Three names for one artefact. Standardise on the published asset name: dataDir is now kegg118_eukaryotes / kegg118_prokaryotes, so the local directory, the local .hmm library, and the downloaded asset all share one name. The hmmDomains/hmmIndex parallel array is removed -- the matched dataDir suffix doubles as the download filename. Docstring and the tutorial5 example updated accordingly.
1 parent a0cf12d commit aa5e2c6

2 files changed

Lines changed: 12 additions & 11 deletions

File tree

reconstruction/kegg/getKEGGModelForOrganism.m

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -26,9 +26,9 @@
2626
% keepUndefinedStoich, keepIncomplete and keepGeneral.
2727
% dataDir : char
2828
% directory for which to retrieve the input data, styled as
29-
% prok90_kegg116 or euk90_kegg116, indicating whether the HMMs were
30-
% trained on pro- or eukaryotic sequences (first set of digits is the
31-
% sequence similarity threshold, second set is the KEGG version). The
29+
% kegg118_prokaryotes or kegg118_eukaryotes, indicating the KEGG version
30+
% and whether the HMMs were trained on pro- or eukaryotic sequences. The
31+
% directory name matches the published HMM library it is paired with. The
3232
% prebuilt concatenated KO HMM library (dataDir.hmm) is downloaded here
3333
% from the corresponding RAVEN release if not already present. May also
3434
% contain a dataDir\keggdb sub-folder with a local KEGG FTP dump, used to
@@ -97,7 +97,7 @@
9797
% maxPhylDist controls which organisms' annotations are considered.
9898
% 2. From protein homology (fastaFile supplied). The query proteome is
9999
% searched, in a single hmmsearch, against a prebuilt KEGG-version- and
100-
% domain-specific concatenated KO HMM library (e.g. kegg116_eukaryotes),
100+
% domain-specific concatenated KO HMM library (e.g. kegg118_eukaryotes),
101101
% downloaded from the corresponding RAVEN release if not already present
102102
% in dataDir. Hits are filtered by cutOff and the minScoreRatioKO /
103103
% minScoreRatioG ratios into a KO-gene matrix, from which the model is
@@ -200,8 +200,7 @@
200200
%gzip-compressed flatfile, queried in one hmmsearch); if it is not already
201201
%present it is downloaded from the corresponding RAVEN release.
202202
if ~isempty(dataDir)
203-
hmmOptions={'euk90_kegg116','prok90_kegg116'};
204-
hmmDomains={'eukaryotes','prokaryotes'}; %Aligned with hmmOptions
203+
hmmOptions={'kegg118_eukaryotes','kegg118_prokaryotes'};
205204
if ~endsWith(dataDir,hmmOptions) %Check if dataDir ends with any of the hmmOptions.
206205
%If not, then check whether the required keggdb folder exists anyway.
207206
if ~isfile(fullfile(dataDir,'keggdb','genes.pep'))
@@ -210,8 +209,10 @@
210209
else
211210
%dataDir points to a RAVEN-provided set. Use the concatenated KO HMM
212211
%library (one gzip-compressed flatfile, queried in a single
213-
%hmmsearch), downloading and extracting it if necessary.
214-
hmmIndex=find(endsWith(dataDir,hmmOptions),1);
212+
%hmmsearch), downloading and extracting it if necessary. The dataDir
213+
%name matches the published HMM library asset, so it doubles as the
214+
%download filename.
215+
hmmName=hmmOptions{endsWith(dataDir,hmmOptions)};
215216
libraryFile=[dataDir '.hmm'];
216217
if isfile(libraryFile)
217218
fprintf(['NOTE: Found <strong>' libraryFile '</strong> HMM library, it will therefore be used during reconstruction\n']);
@@ -223,7 +224,7 @@
223224
else
224225
fprintf('Downloading the HMM library file... ');
225226
try
226-
websave([libraryFile '.gz'],['https://github.com/SysBioChalmers/raven-toolbox/releases/download/v0.1.0/kegg116_' hmmDomains{hmmIndex} '.hmm.gz']);
227+
websave([libraryFile '.gz'],['https://github.com/SysBioChalmers/raven-toolbox/releases/download/v0.3.0/' hmmName '.hmm.gz']);
227228
catch ME
228229
if strcmp(ME.identifier,'MATLAB:webservices:HTTP404StatusCodeError')
229230
error('Failed to download the HMM library file, the server returned a 404 error, try again later. If the problem persists please report it on the RAVEN GitHub Issues page: https://github.com/SysBioChalmers/RAVEN/issues')

tutorial/tutorial5.m

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
%
1010
% Start by downloading trained Hidden Markov Models for eukaryotes. This can
1111
% be done automatically or manually from the RAVEN Wiki in its GitHub
12-
% repository. In this tutorial, the archive "euk90_kegg105" is picked for
12+
% repository. In this tutorial, the archive "kegg118_eukaryotes" is picked for
1313
% the automatic download. See the documentation in the RAVEN Wiki for more
1414
% information regarding preparation of such archive.
1515
%
@@ -19,7 +19,7 @@
1919
% are for. This process takes up to 20-35 minutes in macOS, Unix systems and
2020
% 40-55 minutes in Windows, depending on your hardware and the size of
2121
% target organism proteome
22-
model=getKEGGModelForOrganism('sce','sce.fa','euk90_kegg105','output',false,false,false,false,10^-30,0.8,0.3,-1);
22+
model=getKEGGModelForOrganism('sce','sce.fa','kegg118_eukaryotes','output',false,false,false,false,10^-30,0.8,0.3,-1);
2323

2424
% The resulting model should contain around 1589 reactions, 1600
2525
% metabolites and 836 genes. Small variations are possible since it is an

0 commit comments

Comments
 (0)