Add best-effort Kraken→Tesseract converter script with VGSL mapping and unsupported-feature reporting#4556
Add best-effort Kraken→Tesseract converter script with VGSL mapping and unsupported-feature reporting#4556Copilot wants to merge 7 commits into
Conversation
Not up to standards ⛔🔴 Issues
|
| Category | Results |
|---|---|
| Documentation | 2 minor |
| ErrorProne | 1 high |
| Security | 1 medium |
| CodeStyle | 2 minor |
| Complexity | 1 medium |
🟢 Metrics 68 complexity · 0 duplication
Metric Results Complexity 68 Duplication 0
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
7e80266 to
a870857
Compare
stweil
left a comment
There was a problem hiding this comment.
@copilot, the script currently fails:
src/training/kraken_to_tesseract.py --input ~/.local/share/ocrd-resources/ocrd-kraken-recognize/en_best.mlmodel --output_prefix en
Warning: torch.load uses Python pickle and must only be used with trusted model files.
Traceback (most recent call last):
File "/tesseract/src/training/kraken_to_tesseract.py", line 278, in <module>
raise SystemExit(main())
~~~~^^
File "/tesseract/src/training/kraken_to_tesseract.py", line 216, in main
model_obj = _torch_load_model(args.input)
File "/tesseract/src/training/kraken_to_tesseract.py", line 169, in _torch_load_model
return torch.load(path, **kwargs)
~~~~~~~~~~^^^^^^^^^^^^^^^^
File "/venv3.14/lib/python3.14/site-packages/torch/serialization.py", line 1553, in load
raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
_pickle.UnpicklingError: Weights only load failed. In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
Please file an issue with the following so that we can make `weights_only=True` compatible with your use case: WeightsUnpickler error:
Unsupported operand 8
Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.
And it only supports the conversion from kraken to Tesseract, but not vice versa.
Addressed in
The script remains intentionally one-way (Kraken → Tesseract) in this PR’s scope; reverse conversion is not implemented yet. |
|
@copilot, it's still failing: |
... Fixed in The converter now handles this case by falling back to Kraken’s loader when both |
|
@copilot, new error: |
... Fixed in commit d3ce5f0. Removed all |
|
@copilot, the code still does not work: |
Fixed in the latest commit. The root cause was Python's implicit exception chaining ( Two changes fix this:
When kraken is not installed, the output is now: |
Recent issue discussion asked for a Python path to convert Kraken recognition models into Tesseract artifacts and explicitly called out VGSL compatibility gaps (
Do,Bn,Gr,A<act>,Lpa). This PR adds an initial converter that handles supported mappings and fails/flags clearly for unsupported Kraken extensions.Converter entrypoint (
src/training/kraken_to_tesseract.py).mlmodel, extracts VGSL and weights.torch.loadfirst (withweights_onlyhandling where available), and falls back to Kraken's loader for.mlmodelfiles that are not readable as torch payloads.*.network_spec(mapped VGSL)*.weights.npz(tensor weights)*.conversion.json(mapping summary + unsupported layers)VGSL compatibility behavior
1,C,M,L,F,O,S,P,A<n>,R,T).Do*dropped by default (or surfaced as unsupported with--keep_dropout)Bn*,Gr*,A<act>,Lpa*reported as unsupported--allow_unsupportedto emit partial output for incremental conversion workflows.Operational hardening
torch.loadpickle semantics), with safer option handling where available.Documentation (
doc/kraken_conversion.md)torch,numpy, optionalkrakenfor fallback loading), emitted files, mapping rules, and current limitations.Original prompt
This pull request was created from Copilot chat.