Skip to content

Modernize datadir #4372

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions src/api/baseapi.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -346,7 +346,7 @@ int TessBaseAPI::Init(const char *data, int data_size, const char *language, Ocr
// Update datapath and language requested for the last valid initialization.
datapath_ = std::move(datapath);
if (datapath_.empty() && !tesseract_->datadir.empty()) {
datapath_ = tesseract_->datadir;
datapath_ = tesseract_->datadir.string();
}

language_ = language;
Expand Down Expand Up @@ -395,7 +395,7 @@ void TessBaseAPI::GetLoadedLanguagesAsVector(std::vector<std::string> *langs) co
void TessBaseAPI::GetAvailableLanguagesAsVector(std::vector<std::string> *langs) const {
langs->clear();
if (tesseract_ != nullptr) {
addAvailableLanguages(tesseract_->datadir, langs);
addAvailableLanguages(tesseract_->datadir.string(), langs);
std::sort(langs->begin(), langs->end());
}
}
Expand Down Expand Up @@ -857,7 +857,7 @@ const char *TessBaseAPI::GetInputName() {
}

const char *TessBaseAPI::GetDatapath() {
return tesseract_->datadir.c_str();
return datapath_.c_str();
}

int TessBaseAPI::GetSourceYResolution() {
Expand Down
2 changes: 1 addition & 1 deletion src/ccmain/paramsd.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -298,7 +298,7 @@ ParamsEditor::ParamsEditor(tesseract::Tesseract *tess, ScrollView *sv) {
SVMenuNode *svMenuRoot = BuildListOfAllLeaves(tess);

std::string paramfile;
paramfile = tess->datadir;
paramfile = tess->datadir.string();
paramfile += VARDIR; // parameters dir
paramfile += "edited"; // actual name

Expand Down
57 changes: 27 additions & 30 deletions src/ccmain/tessedit.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@
#include "params.h"
#include "stopper.h"
#include "tesseractclass.h"
#include "tesserrstream.h" // for tesserr
#include "tessvars.h"
#include "tprintf.h"
#ifndef DISABLED_LEGACY_ENGINE
Expand All @@ -43,24 +44,25 @@ namespace tesseract {
// Read a "config" file containing a set of variable, value pairs.
// Searches the standard places: tessdata/configs, tessdata/tessconfigs
// and also accepts a relative or absolute path name.
void Tesseract::read_config_file(const char *filename, SetParamConstraint constraint) {
std::string path = datadir;
path += "configs/";
path += filename;
FILE *fp;
if ((fp = fopen(path.c_str(), "rb")) != nullptr) {
fclose(fp);
} else {
path = datadir;
path += "tessconfigs/";
path += filename;
if ((fp = fopen(path.c_str(), "rb")) != nullptr) {
fclose(fp);
} else {
path = filename;
}
}
ParamUtils::ReadParamsFile(path.c_str(), constraint, this->params());
void Tesseract::read_config_file(const char *filename,
SetParamConstraint constraint) {
// Construct potential config file paths
std::vector<std::filesystem::path> config_paths = {
datadir / "configs" / filename,
datadir / "tessconfigs" / filename,
std::filesystem::path(filename)};

// Use the first existing file or fallback to the last (filename)
auto config_file = std::find_if(config_paths.begin(), config_paths.end(),
[](const std::filesystem::path &path) {
std::error_code ec;
return std::filesystem::exists(path, ec);
});
const std::filesystem::path &selected_path =
(config_file != config_paths.end()) ? *config_file : config_paths.back();

ParamUtils::ReadParamsFile(selected_path.string().c_str(), constraint,
this->params());
}

// Returns false if a unicharset file for the specified language was not found
Expand All @@ -81,17 +83,14 @@ bool Tesseract::init_tesseract_lang_data(const std::string &arg0,
bool set_only_non_debug_params, TessdataManager *mgr) {
// Set the language data path prefix
lang = !language.empty() ? language : "eng";
language_data_path_prefix = datadir;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

language_data_path_prefix is a class variable which is no longer set in the new code. This causes the CI failures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed 2a296fa

language_data_path_prefix += lang;
language_data_path_prefix += ".";
language_data_path_prefix = datadir.string();
std::filesystem::path tessdata_path = datadir / (lang + "." + kTrainedDataSuffix);

// Initialize TessdataManager.
std::string tessdata_path = language_data_path_prefix + kTrainedDataSuffix;
if (!mgr->is_loaded() && !mgr->Init(tessdata_path.c_str())) {
tprintf("Error opening data file %s\n", tessdata_path.c_str());
tprintf(
if (!mgr->is_loaded() && !mgr->Init(tessdata_path.string().c_str())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!mgr->is_loaded() && !mgr->Init(tessdata_path.string().c_str())) {
if (!mgr->is_loaded() && !mgr->Init(tessdata_path.c_str())) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change would broke msvc build:
error C2664: 'bool tesseract::TessdataManager::Init(const char *)': cannot convert argument 1 from 'const std: :filesystem::path::value_type *' to 'const char *'

tesserr << "Error opening data file " << tessdata_path.string() << '\n' <<
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tesserr << "Error opening data file " << tessdata_path.string() << '\n' <<
tesserr << "Error opening data file " << tessdata_path << '\n' <<

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the difference in path formatting between using tesseract.exe a b -l c with tessdata_path.string() versus tessdata_path:

  • tessdata_path.string() produces a native OS path (e.g., on Windows):
Error opening data file X:\Projects\tesseract\c.traineddata
  • tessdata_path produces an escaped path:
Error opening data file "X:\\Projects\\tesseract\\c.traineddata"

I believe it’s more user-friendly and intuitive to display the native OS path. Is there a specific reason we should use the escaped version instead? If not, I recommend sticking with the native path for error messages.

"Please make sure the TESSDATA_PREFIX environment variable is set"
" to your \"tessdata\" directory.\n");
" to your \"tessdata\" directory.\n";
return false;
}
#ifdef DISABLED_LEGACY_ENGINE
Expand Down Expand Up @@ -184,10 +183,8 @@ bool Tesseract::init_tesseract_lang_data(const std::string &arg0,
}
#ifndef DISABLED_LEGACY_ENGINE
else if (!mgr->GetComponent(TESSDATA_UNICHARSET, &fp) || !unicharset.load_from_file(&fp, false)) {
tprintf(
"Error: Tesseract (legacy) engine requested, but components are "
"not present in %s!!\n",
tessdata_path.c_str());
tesserr << "Error: Tesseract (legacy) engine requested, but components are "
"not present in " << tessdata_path.string() << "!!\n";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"not present in " << tessdata_path.string() << "!!\n";
"not present in " << tessdata_path << "!!\n";

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as above: I recommend sticking with the native path for error messages

return false;
}
#endif // ndef DISABLED_LEGACY_ENGINE
Expand Down
113 changes: 58 additions & 55 deletions src/ccutil/ccutil.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,10 @@
// limitations under the License.

#include "ccutil.h"
#include "tesserrstream.h" // for tesserr
#include "tprintf.h" // for tprintf

#include <cstdlib>
#include <cstring> // for std::strrchrA
#include <filesystem> // for std::filesystem

namespace tesseract {

Expand All @@ -33,68 +32,72 @@ CCUtil::CCUtil()
CCUtil::~CCUtil() = default;

/**
* @brief CCUtil::main_setup - set location of tessdata and name of image
* @brief Finds the path to the tessdata directory.
*
* @param argv0 - paths to the directory with language files and config files.
* An actual value of argv0 is used if not nullptr, otherwise TESSDATA_PREFIX is
* used if not nullptr, next try to use compiled in -DTESSDATA_PREFIX. If
* previous is not successful - use current directory.
* @param basename - name of image
* This function determines the location of the tessdata directory based on the
* following order of precedence:
* 1. If `argv0` is provided, use it.
* 2. If `TESSDATA_PREFIX` environment variable is set and the path exists, use
* it.
* 3. On Windows, check for a "tessdata" directory in the executable's directory
* and use it.
* 4. If `TESSDATA_PREFIX` is defined at compile time, use it.
* 5. Otherwise, use the current working directory.
*
* @param argv0 argument to be considered as the data directory path.
* @return The path to the tessdata directory or current directory.
*/
void CCUtil::main_setup(const std::string &argv0, const std::string &basename) {
imagebasename = basename; /**< name of image */

const char *tessdata_prefix = getenv("TESSDATA_PREFIX");

// Ignore TESSDATA_PREFIX if there is no matching filesystem entry.
if (tessdata_prefix != nullptr && !std::filesystem::exists(tessdata_prefix)) {
tprintf("Warning: TESSDATA_PREFIX %s does not exist, ignore it\n", tessdata_prefix);
tessdata_prefix = nullptr;
}

static std::filesystem::path find_data_path(const std::string &argv0) {
// If argv0 is set, always use it even if it is not a valid directory
if (!argv0.empty()) {
/* Use tessdata prefix from the command line. */
datadir = argv0;
} else if (tessdata_prefix) {
/* Use tessdata prefix from the environment. */
datadir = tessdata_prefix;
#if defined(_WIN32)
} else if (datadir.empty() || !std::filesystem::exists(datadir)) {
/* Look for tessdata in directory of executable. */
char path[_MAX_PATH];
DWORD length = GetModuleFileName(nullptr, path, sizeof(path));
if (length > 0 && length < sizeof(path)) {
char *separator = std::strrchr(path, '\\');
if (separator != nullptr) {
*separator = '\0';
std::string subdir = path;
subdir += "/tessdata";
if (std::filesystem::exists(subdir)) {
datadir = subdir;
}
}
std::filesystem::path path(argv0);
if (!std::filesystem::is_directory(path)) {
tesserr << "Warning (tessdata): '" << argv0 << "' is not a valid directory.\n";
}
#endif /* _WIN32 */
return path;
}

// datadir may still be empty:
if (datadir.empty()) {
#if defined(TESSDATA_PREFIX)
// Use tessdata prefix which was compiled in.
datadir = TESSDATA_PREFIX "/tessdata/";
// Note that some software (for example conda) patches TESSDATA_PREFIX
// in the binary, so it might be shorter. Recalculate its length.
datadir.resize(std::strlen(datadir.c_str()));
#else
datadir = "./";
#endif /* TESSDATA_PREFIX */
// Check environment variable if argv0 is not specified
if (const char *tessdata_prefix = std::getenv("TESSDATA_PREFIX")) {
std::filesystem::path path(tessdata_prefix);
if (std::filesystem::exists(path)) {
return path;
} else {
tprintf("Warning: TESSDATA_PREFIX %s does not exist, ignoring.\n",
tessdata_prefix);
}
}

// check for missing directory separator
const char lastchar = datadir.back();
if (lastchar != '/' && lastchar != '\\') {
datadir += '/';
#ifdef _WIN32
// Windows-specific: check for 'tessdata' not existing in the executable
// directory
wchar_t path[MAX_PATH];
if (DWORD length = GetModuleFileNameW(nullptr, path, MAX_PATH);
length > 0 && length < MAX_PATH) {
std::filesystem::path exe_path(path);
auto tessdata_subdir = exe_path.parent_path() / "tessdata";
if (std::filesystem::exists(tessdata_subdir)) {
return tessdata_subdir;
}
}
#endif

// Fallback to compile-time or current directory
#ifdef TESSDATA_PREFIX
return std::filesystem::path(TESSDATA_PREFIX) / "tessdata";
#else
return std::filesystem::current_path();
#endif
}


/**
* @brief CCUtil::main_setup - set location of tessdata and name of image
*
* @param argv0 - paths to the directory with language files and config files.
*/
void CCUtil::main_setup(const std::string &argv0, const std::string &basename) {
imagebasename = basename; /**< name of image */
datadir = find_data_path(argv0);
Copy link
Member

@stweil stweil May 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old datadir was always terminated with a '/' which is missing for the new datadir. This is an API change, but it might also be it is not the reason for the failing CI tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
datadir = find_data_path(argv0);
datadir = find_data_path(argv0) / "";

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, datadir was a string, and appending a / was used as a workaround to handle cases where users omitted the directory separator.
However, with datadir now being a std::filesystem::path, this manual string manipulation is no longer necessary. The path class automatically normalizes path separators and handles path operations (like concatenation) correctly for the underlying OS.
As a result, we can safely remove the workaround and rely on the standard library to manage path separators and concatenation.
Or do I miss something?

}
} // namespace tesseract
7 changes: 4 additions & 3 deletions src/ccutil/ccutil.h
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@
#ifndef TESSERACT_CCUTIL_CCUTIL_H_
#define TESSERACT_CCUTIL_CCUTIL_H_

#include <filesystem> // for std::filesystem

#ifndef _WIN32
# include <pthread.h>
# include <semaphore.h>
Expand Down Expand Up @@ -53,9 +55,8 @@ class TESS_API CCUtil {
ParamsVectors *params() {
return &params_;
}

std::string datadir; // dir for data files
std::string imagebasename; // name of image
std::filesystem::path datadir; // dir for data files
std::string imagebasename; // name of image
std::string lang;
std::string language_data_path_prefix;
UNICHARSET unicharset;
Expand Down
Loading