-
Notifications
You must be signed in to change notification settings - Fork 2
Input & Output Files Format
Always as the first operation in KaMRaT workflow, KaMRaT index takes a feature count matrix (as described in the General Description page) and generates the index from it as output:
- idx-meta.bin
- idx-mat.bin
- idx-pos.bin
The idx-meta.bin file contains metadata of the analysis, including:
- Line 1: sample number, k-mer length (0 if in general feature mode), and strandedness (only if k-mer length is not 0, "T" if stranded RNA-seq data, "F" if not). The items are separated by a
\tcharacter; - Line 2: the header line of k-mer count table;
- Line 3 (optional): the normalisation factors in same order indicated by the header line (if normalisation is applied).
The idx-mat.bin file indexes the count matrix in binary form: for each feature (whether k-mers or general ones), firstly output its binarised count vector (smp_num float elements), then the feature string (k-mer sequence or general feature name). This is followed by a new line character.
The idx-pos.bin file records the feature count vectors' index position, depending on the indexing mode:
- For k-mer features, pairs of binarised
(k-mer code, index position)(uint64_t and size_t) are continuously written feature by feature; - For general features, only binarised
index position(size_t) is continuously written feature by feature.
The operational modules filter and mask can generate three different types of output.
If the modules are used for obtaining final results (-outfmt tab or simply by default), the result is a feature count matrix of
-
$p$ is the reduced feature number,$p < P$ ; -
$N$ is the sample number; - the first row and the first column is the matrix header and feature strings (k-mers or general feature names), respectively.
if the modules are used for obtaining a FASTA file (-outfmt fa), the result is a FASTA file containing selected sequences. Sequence names are simply given with a serial ID starting from 0.
If the modules are used for generating intermediate results (-outfmt bin), the result is a binarised multi-row file. Each row has 4 items, separated by the \t character:
- k-mer sequence or general feature name
- a place holder as
0 - a place holder as
1 - binarised position in the
idx-mat.binfile (size_t)
The operational module query only takes results of KaMRaT index, and generates a k-mer/contig count matrix of
-
$M$ is the k-mer/contig number in query; -
$N$ is the sample number; - the first row and the first column is the matrix header and feature strings (k-mers or general feature names), respectively.
The operational modules merge and score can takes two types of input and can generates three types of output.
If used as a secondary module following filter/mask/merge/score operations, the intermediate result of the primary module should be offered (-with argument) apart from the results of KaMRaT index. This is a multi-row binary file, with each row containing:
- k-mer sequence or general feature name
- feature's score value or representative level value
- number of merged k-mers
- binarised position in the
idx-mat.binfile (size_t)
If used directly following the KaMRaT index module (by default), the only input is the KaMRaT index result folder.
If the modules are used for obtaining final results (-outfmt tab or simply by default), the output should be a feature count matrix of
-
$p$ is the reduced feature number,$p < P$ ; -
$N$ is the sample number; - the first row and the first column is the matrix header and feature strings (k-mers or general feature names), respectively.
if the modules are used for obtaining a FASTA file (-outfmt fa), the result is a FASTA file containing selected sequences. Sequence names are simply given with a serial ID starting from 0.
If the modules are used for generating intermediate results (-outfmt bin), the result is a binarised multi-row file. Each row has 4 items, separated by the \t character:
- k-mer sequence or general feature name
- feature's score value or representative level value
- number of merged k-mers
- binarised position in the
idx-mat.binfile (size_t)