Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
85 changes: 85 additions & 0 deletions doc/kraken_conversion.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Kraken → Tesseract model conversion (best-effort)

## Script

- `src/training/kraken_to_tesseract.py`

## Prerequisites

- Python 3.8+
- `torch`
- `numpy`
- `kraken` (optional fallback for CoreML-style `.mlmodel` files)

Example installation:

```bash
python3 -m pip install torch numpy
# optional fallback loader for non-torch .mlmodel payloads
python3 -m pip install kraken
```

## Usage

Convert a kraken `.mlmodel` into a conversion bundle:

```bash
python3 src/training/kraken_to_tesseract.py \
--input /path/to/model.mlmodel \
--output_prefix /path/to/output/eng
```

If the kraken VGSL contains unsupported layers, the script exits with details.
To still emit a partial bundle:

```bash
python3 src/training/kraken_to_tesseract.py \
--input /path/to/model.mlmodel \
--output_prefix /path/to/output/eng \
--allow_unsupported
```

## Generated files

For `--output_prefix /path/to/output/eng`:

- `/path/to/output/eng.network_spec`: mapped Tesseract VGSL spec
- `/path/to/output/eng.weights.npz`: extracted kraken tensor weights (NumPy)
- `/path/to/output/eng.conversion.json`: mapping summary and unsupported layer report

## VGSL feature mapping

Mapped directly where possible (depends on input model):

- Input / shape (`1...`)
- Convolution (`C...`, including `Cr/Ct/Cf` forms accepted by Tesseract)
- Maxpool (`M...`, including `Mp...`)
- LSTM family (`L...` supported by Tesseract parser)
- Fully connected / output (`F...`, `O...`)
- Other core Tesseract parser families (`S`, `P`, `A1`, `R`, `T`)

Kraken extensions that have no direct Tesseract equivalent are reported as unsupported:

- `Do` (Dropout) — dropped by default, or treated as unsupported with `--keep_dropout`
- `Bn` (BatchNorm)
- `Gr` (GroupNorm)
- `A<act>` standalone activation layers (except `A1` reduction)
- `Lpa` attention-based LSTM variants

## Format differences and limitations

Kraken models in the wild may be torch-serialized payloads or CoreML-style
`.mlmodel` files. The converter first tries `torch.load` and falls back to
Kraken's own model loader (when `kraken` is installed) for compatibility.
Tesseract runtime models use Tesseract's C++ binary network serialization inside
the `.lstm` component in `.traineddata`.

> ⚠️ `torch.load` uses Python pickle under the hood. Only run the converter on
> trusted `.mlmodel` files, and prefer models with known provenance
> (for example verified checksums/signatures).
> On PyTorch versions that support `weights_only=True`, the converter tries that
> first and falls back to `weights_only=False` for compatibility when needed.

This script currently exports a conversion bundle (`.network_spec` + `.weights.npz`
+ `.conversion.json`) and reports unsupported VGSL features. It does **not** yet
emit a final Tesseract `.lstm`/`.traineddata` binary directly.
Loading