Commit afbda95
authored
feat: custom fallback for language detection (#4238)
Closes #4091
Implements custom fallback for language detection so short text is not
forced to English and callers can control or disable detection.
## Changes:
- language_fallback
Optional callable used when text is short (<5 words) and ASCII. It
receives the text and can return a list of ISO 639-3 codes or None to
leave language unspecified. If not provided, short text still defaults
to ["eng"] (backward compatible).
- detect_languages() / apply_lang_metadata()
New parameter language_fallback; applied in the short-text path only.
- partition() (auto)
New parameter language_fallback; passed through to all partitioners via
the metadata decorator.
- partition_md()
New parameter languages so callers can pass languages=[""] to disable
language detection (aligned with other partitioners).
## Usage:
- Return None for short text: partition(..., language_fallback=lambda
text: None)
- Custom short-text language: partition(...,
language_fallback=my_detector)
- Disable detection: partition_md(..., languages=[""]) or partition(...,
languages=[""])1 parent 4a77a8c commit afbda95
File tree
11 files changed
+1138
-795
lines changed- test_unstructured_ingest/expected-structured-output/local-single-file
- test_unstructured/partition
- common
- unstructured
- partition
- common
- html
11 files changed
+1138
-795
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
1 | 6 | | |
2 | 7 | | |
3 | 8 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
196 | 196 | | |
197 | 197 | | |
198 | 198 | | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
199 | 238 | | |
200 | 239 | | |
201 | 240 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
563 | 563 | | |
564 | 564 | | |
565 | 565 | | |
| 566 | + | |
566 | 567 | | |
567 | 568 | | |
568 | 569 | | |
| |||
1074 | 1075 | | |
1075 | 1076 | | |
1076 | 1077 | | |
1077 | | - | |
| 1078 | + | |
| 1079 | + | |
1078 | 1080 | | |
1079 | 1081 | | |
1080 | 1082 | | |
1081 | 1083 | | |
| 1084 | + | |
| 1085 | + | |
| 1086 | + | |
| 1087 | + | |
| 1088 | + | |
| 1089 | + | |
| 1090 | + | |
| 1091 | + | |
| 1092 | + | |
| 1093 | + | |
| 1094 | + | |
| 1095 | + | |
| 1096 | + | |
| 1097 | + | |
| 1098 | + | |
| 1099 | + | |
| 1100 | + | |
| 1101 | + | |
| 1102 | + | |
| 1103 | + | |
| 1104 | + | |
| 1105 | + | |
1082 | 1106 | | |
1083 | 1107 | | |
1084 | 1108 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
248 | 248 | | |
249 | 249 | | |
250 | 250 | | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
251 | 260 | | |
252 | 261 | | |
253 | 262 | | |
| |||
0 commit comments