Fix crash when LSTM is missing in disabled-legacy build (#4448)#4563
Fix crash when LSTM is missing in disabled-legacy build (#4448)#4563gaurav0107 wants to merge 2 commits into
Conversation
…r#4448) When tesseract is built with --disable-legacy and the requested traineddata contains no LSTM component, init_tesseract_lang_data previously printed a warning and downgraded the engine mode to OEM_TESSERACT_ONLY before returning true. The legacy engine code is compiled out in this build, so no recognizer is actually loaded. Recognition then runs with no engine, leaving best_choice NULL on every word, and pass-1 post-processing in recog_all_words segfaults when it dereferences best_choice (control.cpp:356). Under DISABLED_LEGACY_ENGINE there is no legacy fallback to take, so return false from init_tesseract_lang_data with a clearer error message that names the offending tessdata path. The caller in init_tesseract already handles a per-language load failure by logging "Failed loading language '%s'" and, if no language loads at all, "Tesseract couldn't load any languages!", so multi-language inputs (-l eng+gsta) degrade gracefully. The default build (legacy enabled) is unchanged: the existing fallback to OEM_TESSERACT_ONLY remains in place.
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 0 |
| Duplication | 0 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
Brief comment explaining the contract with init_tesseract: a per-language load failure is logged by the caller via "Failed loading language '%s'", and if no language loads at all the existing "Tesseract couldn't load any languages!" path triggers. No behavior change.
|
@gaurav0107 thanks for fixing this! I'm not so familiar with Tesseract code so I didn't understand how to take amido's recommendation info account. :) |
|
Hi, Did you check it works as intended? You also should check the multi tessdata scenario, where one of them has lstm and the other one does not. |
|
@amitdo Thanks for the review. Yes, I tested the single-language case in a You are right that I have not yet exercised the multi-tessdata scenario (one language with LSTM, one without) in the same build. I will run that case and report back before this is merged; the expectation is that the LSTM-capable language continues to load and only the no-LSTM one is skipped. |
Summary
Fixes #4448. When tesseract is compiled with
--disable-legacyand theuser supplies a
.traineddatafile that does not contain an LSTMcomponent (e.g. the
gsta.traineddataattached to the issue), tesseractcrashes with a segmentation fault during pass-1 post-processing.
Root cause
Tesseract::init_tesseract_lang_data(src/ccmain/tessedit.cpp:171–178)checks whether an LSTM component is available. If not, it prints
"Error: LSTM requested, but not present!! Loading tesseract."anddowngrades
tessedit_ocr_engine_modetoOEM_TESSERACT_ONLY. UnderDISABLED_LEGACY_ENGINEthe legacy engine code is compiled out, so this"fallback" never actually loads a recognizer — but the function still
returns
true. Recognition runs with no engine, every word ends up withbest_choice == nullptr, andrecog_all_wordssegfaults atcontrol.cpp:356when it dereferencesbest_choice->permuter().Reproduction (from issue #4448)
The valgrind trace pinpoints the NULL deref at
tesseract::WERD_CHOICE::permuter() const (ratngs.h:332)reached fromrecog_all_words (control.cpp:356).Fix
Inside the existing
elsebranch (LSTM unavailable), split behavior bybuild mode:
DISABLED_LEGACY_ENGINE: emit a clearer error message thatnames the tessdata path, and
return false. The caller(
init_tesseract) already handles a per-language load failure with"Failed loading language '%s'", and if no language loads at all theexisting
"Tesseract couldn't load any languages!"path triggers, somulti-language inputs (
-l eng+gsta) degrade gracefully.OEM_TESSERACT_ONLYis unchanged.The diff is 8 added lines, 0 removed, confined to one TU.
Relationship to PR #4449
@sebras opened #4449 with a one-line NULL guard at the crash site in
control.cpp. This PR addresses the deeper fix that @amitdo describedin the #4448 thread: refuse to claim a successful init when no engine
can run. The two changes are orthogonal — #4449 is defense in depth at
the consumer side, this PR fixes the producer side. Either can land
first; landing both is fine.
Test plan
git diffreview — change is minimal and confined to one#ifdef-guarded branch.unittest(default build) — behavior unchanged in the#ifndef DISABLED_LEGACY_ENGINEbranch; existing tests pass.unittest-disablelegacy(autotools,--disable-legacy) —exercises the new early-return path; existing tests use tessdata
with LSTM, so they do not trip it.
cmake,cifuzz,CodeQL,Codacy— single conditionalreturn; no new flagged patterns expected.
gsta.traineddata, the segfault isreplaced by a clean error message and a non-zero exit code.