Skip to content

Commit 0841f61

Browse files
committed
PDFBOX-5747: Surrogate pairs with combining diacritics are incorrectly ordered on text extraction
- Changed TextPosition.insertDiacritic() to preserve surrogate pairs - Added unit test - Included example test PDF file attached to PDFBOX-5747
1 parent 0b8bc2d commit 0841f61

File tree

4 files changed

+18
-3
lines changed

4 files changed

+18
-3
lines changed

pdfbox/src/main/java/org/apache/pdfbox/text/TextPosition.java

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -759,16 +759,25 @@ private void insertDiacritic(int i, TextPosition diacritic)
759759
float[] widths2 = new float[widths.length + 1];
760760
System.arraycopy(widths, 0, widths2, 0, i);
761761

762+
// First we add a zero-width entry for the diacritic in the widths array
763+
widths2[i] = widths[i];
764+
widths2[i + 1] = 0;
765+
System.arraycopy(widths, i + 1, widths2, i + 2, widths.length - i - 1);
766+
762767
// Unicode combining diacritics always go after the base character, regardless of whether
763768
// the string is in presentation order or logical order
764769
sb.append(unicode.charAt(i));
765-
widths2[i] = widths[i];
770+
771+
// If a surrogate starts at the current position, make sure we preserve it
772+
if (i < unicode.length() - 1 && Character.isSurrogatePair(unicode.charAt(i), unicode.charAt(i + 1))) {
773+
sb.append(unicode.charAt(i + 1));
774+
i++;
775+
}
776+
766777
sb.append(combineDiacritic(diacritic.getUnicode()));
767-
widths2[i + 1] = 0;
768778

769779
// get the rest of the string
770780
sb.append(unicode.substring(i + 1));
771-
System.arraycopy(widths, i + 1, widths2, i + 2, widths.length - i - 1);
772781

773782
unicode = sb.toString();
774783
widths = widths2;
Binary file not shown.
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
Firefox file:///home/pablo/invchar
2+
𝑋̂
3+
1 of 1 18/12/2023, 12:49
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
𝑋̂
2+
Firefox file:///home/pablo/invchar
3+
1 of 1 18/12/2023, 12:49

0 commit comments

Comments
 (0)