Fix crash when LSTM is missing in disabled-legacy build (#4448) by gaurav0107 · Pull Request #4563 · tesseract-ocr/tesseract

gaurav0107 · 2026-06-06T13:32:00Z

Summary

Fixes #4448. When tesseract is compiled with --disable-legacy and the
user supplies a .traineddata file that does not contain an LSTM
component (e.g. the gsta.traineddata attached to the issue), tesseract
crashes with a segmentation fault during pass-1 post-processing.

Root cause

Tesseract::init_tesseract_lang_data (src/ccmain/tessedit.cpp:171–178)
checks whether an LSTM component is available. If not, it prints
"Error: LSTM requested, but not present!! Loading tesseract." and
downgrades tessedit_ocr_engine_mode to OEM_TESSERACT_ONLY. Under
DISABLED_LEGACY_ENGINE the legacy engine code is compiled out, so this
"fallback" never actually loads a recognizer — but the function still
returns true. Recognition runs with no engine, every word ends up with
best_choice == nullptr, and recog_all_words segfaults at
control.cpp:356 when it dereferences best_choice->permuter().

Reproduction (from issue #4448)

CXXFLAGS='-g -O0 -fno-omit-frame-pointer' ./configure --prefix=$PWD/installed \
    --enable-debug --disable-legacy && make install
TESSDATA_PREFIX=$PWD ./installed/bin/tesseract -l gsta lorem-ipsum.png out.txt
# Segmentation fault

The valgrind trace pinpoints the NULL deref at
tesseract::WERD_CHOICE::permuter() const (ratngs.h:332) reached from
recog_all_words (control.cpp:356).

Fix

Inside the existing else branch (LSTM unavailable), split behavior by
build mode:

Under DISABLED_LEGACY_ENGINE: emit a clearer error message that
names the tessdata path, and return false. The caller
(init_tesseract) already handles a per-language load failure with
"Failed loading language '%s'", and if no language loads at all the
existing "Tesseract couldn't load any languages!" path triggers, so
multi-language inputs (-l eng+gsta) degrade gracefully.
Default build (legacy enabled): the existing fallback to
OEM_TESSERACT_ONLY is unchanged.

The diff is 8 added lines, 0 removed, confined to one TU.

Relationship to PR #4449

@sebras opened #4449 with a one-line NULL guard at the crash site in
control.cpp. This PR addresses the deeper fix that @amitdo described
in the #4448 thread: refuse to claim a successful init when no engine
can run. The two changes are orthogonal — #4449 is defense in depth at
the consumer side, this PR fixes the producer side. Either can land
first; landing both is fine.

Test plan

git diff review — change is minimal and confined to one
#ifdef-guarded branch.
CI: unittest (default build) — behavior unchanged in the
#ifndef DISABLED_LEGACY_ENGINE branch; existing tests pass.
CI: unittest-disablelegacy (autotools, --disable-legacy) —
exercises the new early-return path; existing tests use tessdata
with LSTM, so they do not trip it.
CI: cmake, cifuzz, CodeQL, Codacy — single conditional
return; no new flagged patterns expected.
Manual: with the reporter's gsta.traineddata, the segfault is
replaced by a clean error message and a non-zero exit code.

…r#4448) When tesseract is built with --disable-legacy and the requested traineddata contains no LSTM component, init_tesseract_lang_data previously printed a warning and downgraded the engine mode to OEM_TESSERACT_ONLY before returning true. The legacy engine code is compiled out in this build, so no recognizer is actually loaded. Recognition then runs with no engine, leaving best_choice NULL on every word, and pass-1 post-processing in recog_all_words segfaults when it dereferences best_choice (control.cpp:356). Under DISABLED_LEGACY_ENGINE there is no legacy fallback to take, so return false from init_tesseract_lang_data with a clearer error message that names the offending tessdata path. The caller in init_tesseract already handles a per-language load failure by logging "Failed loading language '%s'" and, if no language loads at all, "Tesseract couldn't load any languages!", so multi-language inputs (-l eng+gsta) degrade gracefully. The default build (legacy enabled) is unchanged: the existing fallback to OEM_TESSERACT_ONLY remains in place.

codacy-production · 2026-06-06T13:32:53Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 0 complexity · 0 duplication

Metric Results

Complexity 0

Duplication 0

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

Brief comment explaining the contract with init_tesseract: a per-language load failure is logged by the caller via "Failed loading language '%s'", and if no language loads at all the existing "Tesseract couldn't load any languages!" path triggers. No behavior change.

sebras · 2026-06-06T14:05:26Z

@gaurav0107 thanks for fixing this! I'm not so familiar with Tesseract code so I didn't understand how to take amido's recommendation info account. :)

amitdo · 2026-06-08T16:24:53Z

Hi,

Did you check it works as intended? You also should check the multi tessdata scenario, where one of them has lstm and the other one does not.

gaurav0107 · 2026-06-10T03:51:52Z

@amitdo Thanks for the review. Yes, I tested the single-language case in a DISABLED_LEGACY_ENGINE build with a tessdata file that has no LSTM model — previously this crashed via ASSERT_HOST, and with this change it now logs the new error message and init_tesseract_lang_data returns false, so the language is skipped cleanly.

You are right that I have not yet exercised the multi-tessdata scenario (one language with LSTM, one without) in the same build. I will run that case and report back before this is merged; the expectation is that the LSTM-capable language continues to load and only the no-LSTM one is skipped.

gaurav0107 marked this pull request as ready for review June 6, 2026 13:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix crash when LSTM is missing in disabled-legacy build (#4448)#4563

Fix crash when LSTM is missing in disabled-legacy build (#4448)#4563
gaurav0107 wants to merge 2 commits into
tesseract-ocr:mainfrom
gaurav0107:fix/4448-crash-upon-processing-input-file-with-di

gaurav0107 commented Jun 6, 2026

Uh oh!

codacy-production Bot commented Jun 6, 2026 •

edited

Loading

Uh oh!

sebras commented Jun 6, 2026

Uh oh!

amitdo commented Jun 8, 2026 •

edited

Loading

Uh oh!

gaurav0107 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gaurav0107 commented Jun 6, 2026

Summary

Root cause

Reproduction (from issue #4448)

Fix

Relationship to PR #4449

Test plan

Uh oh!

codacy-production Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Up to standards ✅

Uh oh!

sebras commented Jun 6, 2026

Uh oh!

amitdo commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gaurav0107 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codacy-production Bot commented Jun 6, 2026 •

edited

Loading

amitdo commented Jun 8, 2026 •

edited

Loading