Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add "auto" language for TesseractOcr #759

Merged

Conversation

pavel-denisov-fraunhofer
Copy link
Contributor

@pavel-denisov-fraunhofer pavel-denisov-fraunhofer commented Jan 16, 2025

Add language-agnostic OCR option for TesseractOcr module. It is invoked when the language option is set to ['auto']. For more context, see the discussion: #640

Please let me know what you think.

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

Copy link

mergify bot commented Jan 16, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@nikos-livathinos nikos-livathinos self-requested a review January 17, 2025 13:39
@nikos-livathinos
Copy link
Collaborator

We can merge this PR and implement the optimized version at a follow up PR

@pavel-denisov-fraunhofer
Copy link
Contributor Author

We can merge this PR and implement the optimized version at a follow up PR

Sorry for the delay! I was going to check it in the next few days, but can make a follow up PR too.

The problem with CI is that the script OCR models are not installed: https://github.com/DS4SD/docling/actions/runs/12806245234/job/35994648426?pr=759#step:8:155

@pavel-denisov-fraunhofer
Copy link
Contributor Author

Ubuntu package tesseract-ocr-script-latn puts the model to the data directory itself (without script subdirectory): https://packages.ubuntu.com/noble/all/tesseract-ocr-script-latn/filelist

I'm going to add the check if script/ prefix is needed when loading the script model.

Copy link
Collaborator

@nikos-livathinos nikos-livathinos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@nikos-livathinos nikos-livathinos merged commit 8543c22 into DS4SD:main Jan 23, 2025
7 checks passed
@pavel-denisov-fraunhofer pavel-denisov-fraunhofer deleted the ocr-auto-language branch January 23, 2025 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants