Skip to content

Commit

Permalink
Merge pull request #1469 from opendatalab/dev
Browse files Browse the repository at this point in the history
fix(language): enhance language detection and text processing
  • Loading branch information
myhloli authored Jan 9, 2025
2 parents 1b654fc + 0ebbfa5 commit e778264
Showing 1 changed file with 3 additions and 0 deletions.
3 changes: 3 additions & 0 deletions magic_pdf/libs/language.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,14 @@ def detect_lang(text: str) -> str:

if len(text) == 0:
return ""

text = text.replace("\n", "")
try:
lang_upper = detect_language(text)
except:
html_no_ctrl_chars = ''.join([l for l in text if unicodedata.category(l)[0] not in ['C', ]])
lang_upper = detect_language(html_no_ctrl_chars)

try:
lang = lang_upper.lower()
except:
Expand Down

0 comments on commit e778264

Please sign in to comment.