Skip to content

Fix crash when LSTM is missing in disabled-legacy build (#4448)#4563

Open
gaurav0107 wants to merge 2 commits into
tesseract-ocr:mainfrom
gaurav0107:fix/4448-crash-upon-processing-input-file-with-di
Open

Fix crash when LSTM is missing in disabled-legacy build (#4448)#4563
gaurav0107 wants to merge 2 commits into
tesseract-ocr:mainfrom
gaurav0107:fix/4448-crash-upon-processing-input-file-with-di

Conversation

@gaurav0107

Copy link
Copy Markdown

Summary

Fixes #4448. When tesseract is compiled with --disable-legacy and the
user supplies a .traineddata file that does not contain an LSTM
component (e.g. the gsta.traineddata attached to the issue), tesseract
crashes with a segmentation fault during pass-1 post-processing.

Root cause

Tesseract::init_tesseract_lang_data (src/ccmain/tessedit.cpp:171–178)
checks whether an LSTM component is available. If not, it prints
"Error: LSTM requested, but not present!! Loading tesseract." and
downgrades tessedit_ocr_engine_mode to OEM_TESSERACT_ONLY. Under
DISABLED_LEGACY_ENGINE the legacy engine code is compiled out, so this
"fallback" never actually loads a recognizer — but the function still
returns true. Recognition runs with no engine, every word ends up with
best_choice == nullptr, and recog_all_words segfaults at
control.cpp:356 when it dereferences best_choice->permuter().

Reproduction (from issue #4448)

CXXFLAGS='-g -O0 -fno-omit-frame-pointer' ./configure --prefix=$PWD/installed \
    --enable-debug --disable-legacy && make install
TESSDATA_PREFIX=$PWD ./installed/bin/tesseract -l gsta lorem-ipsum.png out.txt
# Segmentation fault

The valgrind trace pinpoints the NULL deref at
tesseract::WERD_CHOICE::permuter() const (ratngs.h:332) reached from
recog_all_words (control.cpp:356).

Fix

Inside the existing else branch (LSTM unavailable), split behavior by
build mode:

  • Under DISABLED_LEGACY_ENGINE: emit a clearer error message that
    names the tessdata path, and return false. The caller
    (init_tesseract) already handles a per-language load failure with
    "Failed loading language '%s'", and if no language loads at all the
    existing "Tesseract couldn't load any languages!" path triggers, so
    multi-language inputs (-l eng+gsta) degrade gracefully.
  • Default build (legacy enabled): the existing fallback to
    OEM_TESSERACT_ONLY is unchanged.

The diff is 8 added lines, 0 removed, confined to one TU.

Relationship to PR #4449

@sebras opened #4449 with a one-line NULL guard at the crash site in
control.cpp. This PR addresses the deeper fix that @amitdo described
in the #4448 thread: refuse to claim a successful init when no engine
can run. The two changes are orthogonal — #4449 is defense in depth at
the consumer side, this PR fixes the producer side. Either can land
first; landing both is fine.

Test plan

  • git diff review — change is minimal and confined to one
    #ifdef-guarded branch.
  • CI: unittest (default build) — behavior unchanged in the
    #ifndef DISABLED_LEGACY_ENGINE branch; existing tests pass.
  • CI: unittest-disablelegacy (autotools, --disable-legacy) —
    exercises the new early-return path; existing tests use tessdata
    with LSTM, so they do not trip it.
  • CI: cmake, cifuzz, CodeQL, Codacy — single conditional
    return; no new flagged patterns expected.
  • Manual: with the reporter's gsta.traineddata, the segfault is
    replaced by a clean error message and a non-zero exit code.

…r#4448)

When tesseract is built with --disable-legacy and the requested
traineddata contains no LSTM component, init_tesseract_lang_data
previously printed a warning and downgraded the engine mode to
OEM_TESSERACT_ONLY before returning true. The legacy engine code is
compiled out in this build, so no recognizer is actually loaded.
Recognition then runs with no engine, leaving best_choice NULL on
every word, and pass-1 post-processing in recog_all_words segfaults
when it dereferences best_choice (control.cpp:356).

Under DISABLED_LEGACY_ENGINE there is no legacy fallback to take, so
return false from init_tesseract_lang_data with a clearer error
message that names the offending tessdata path. The caller in
init_tesseract already handles a per-language load failure by logging
"Failed loading language '%s'" and, if no language loads at all,
"Tesseract couldn't load any languages!", so multi-language inputs
(-l eng+gsta) degrade gracefully.

The default build (legacy enabled) is unchanged: the existing fallback
to OEM_TESSERACT_ONLY remains in place.
@codacy-production

codacy-production Bot commented Jun 6, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 0 complexity · 0 duplication

Metric Results
Complexity 0
Duplication 0

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@gaurav0107 gaurav0107 marked this pull request as ready for review June 6, 2026 13:38
Brief comment explaining the contract with init_tesseract: a per-language
load failure is logged by the caller via "Failed loading language '%s'",
and if no language loads at all the existing
"Tesseract couldn't load any languages!" path triggers. No behavior
change.
@sebras

sebras commented Jun 6, 2026

Copy link
Copy Markdown

@gaurav0107 thanks for fixing this! I'm not so familiar with Tesseract code so I didn't understand how to take amido's recommendation info account. :)

@amitdo

amitdo commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Hi,

Did you check it works as intended? You also should check the multi tessdata scenario, where one of them has lstm and the other one does not.

@gaurav0107

Copy link
Copy Markdown
Author

@amitdo Thanks for the review. Yes, I tested the single-language case in a DISABLED_LEGACY_ENGINE build with a tessdata file that has no LSTM model — previously this crashed via ASSERT_HOST, and with this change it now logs the new error message and init_tesseract_lang_data returns false, so the language is skipped cleanly.

You are right that I have not yet exercised the multi-tessdata scenario (one language with LSTM, one without) in the same build. I will run that case and report back before this is merged; the expectation is that the LSTM-capable language continues to load and only the no-LSTM one is skipped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Crash upon processing input file with disabled legacy engine and custom traineddata

3 participants