Skip to content

Input & Output Files Format

Haoliang Xue edited this page Mar 11, 2025 · 1 revision

1. KaMRaT index

Always as the first operation in KaMRaT workflow, KaMRaT index takes a feature count matrix (as described in the General Description page) and generates the index from it as output:

  • idx-meta.bin
  • idx-mat.bin
  • idx-pos.bin

The idx-meta.bin file contains metadata of the analysis, including:

  • Line 1: sample number, k-mer length (0 if in general feature mode), and strandedness (only if k-mer length is not 0, "T" if stranded RNA-seq data, "F" if not). The items are separated by a \t character;
  • Line 2: the header line of k-mer count table;
  • Line 3 (optional): the normalisation factors in same order indicated by the header line (if normalisation is applied).

The idx-mat.bin file indexes the count matrix in binary form: for each feature (whether k-mers or general ones), firstly output its binarised count vector (smp_num float elements), then the feature string (k-mer sequence or general feature name). This is followed by a new line character.

The idx-pos.bin file records the feature count vectors' index position, depending on the indexing mode:

  • For k-mer features, pairs of binarised (k-mer code, index position) (uint64_t and size_t) are continuously written feature by feature;
  • For general features, only binarised index position (size_t) is continuously written feature by feature.

2. KaMRaT filter/mask

The operational modules filter and mask can generate three different types of output.

If the modules are used for obtaining final results (-outfmt tab or simply by default), the result is a feature count matrix of $(p + 1)$ rows $\times$ $(N + 1)$ columns, where:

  • $p$ is the reduced feature number, $p < P$;
  • $N$ is the sample number;
  • the first row and the first column is the matrix header and feature strings (k-mers or general feature names), respectively.

if the modules are used for obtaining a FASTA file (-outfmt fa), the result is a FASTA file containing selected sequences. Sequence names are simply given with a serial ID starting from 0.

If the modules are used for generating intermediate results (-outfmt bin), the result is a binarised multi-row file. Each row has 4 items, separated by the \t character:

  • k-mer sequence or general feature name
  • a place holder as 0
  • a place holder as 1
  • binarised position in the idx-mat.bin file (size_t)

3. KaMRaT query

The operational module query only takes results of KaMRaT index, and generates a k-mer/contig count matrix of $(M + 1)$ rows $\times$ $(N + 1)$ columns, where:

  • $M$ is the k-mer/contig number in query;
  • $N$ is the sample number;
  • the first row and the first column is the matrix header and feature strings (k-mers or general feature names), respectively.

4. KaMRaT merge/score

The operational modules merge and score can takes two types of input and can generates three types of output.

4.1 Input files

If used as a secondary module following filter/mask/merge/score operations, the intermediate result of the primary module should be offered (-with argument) apart from the results of KaMRaT index. This is a multi-row binary file, with each row containing:

  • k-mer sequence or general feature name
  • feature's score value or representative level value
  • number of merged k-mers
  • binarised position in the idx-mat.bin file (size_t)

If used directly following the KaMRaT index module (by default), the only input is the KaMRaT index result folder.

4.2 Output files

If the modules are used for obtaining final results (-outfmt tab or simply by default), the output should be a feature count matrix of $(p + 1)$ rows $\times$ $(N + 1)$ columns, where:

  • $p$ is the reduced feature number, $p < P$;
  • $N$ is the sample number;
  • the first row and the first column is the matrix header and feature strings (k-mers or general feature names), respectively.

if the modules are used for obtaining a FASTA file (-outfmt fa), the result is a FASTA file containing selected sequences. Sequence names are simply given with a serial ID starting from 0.

If the modules are used for generating intermediate results (-outfmt bin), the result is a binarised multi-row file. Each row has 4 items, separated by the \t character:

  • k-mer sequence or general feature name
  • feature's score value or representative level value
  • number of merged k-mers
  • binarised position in the idx-mat.bin file (size_t)

Clone this wiki locally