Fix multibyte character corruption in post summaries #1995

obenland · 2025-07-23T19:56:15Z

Fixes #1948

Proposed changes:

Fix character encoding corruption in generate_post_summary() function by reordering text processing operations.
Replace byte-based string functions with multibyte-safe equivalents throughout the codebase.
Add comprehensive test coverage for Greek text handling.

Other information:

Have you written new tests for your changes, if applicable?

Testing instructions:

Create a post with Greek text content like: Τι μπορεί να σου συμβεί σε μια βόλτα για να αγοράσεις μια βαλίτσα για τα ταξίδια σου; Όλα είναι πιθανά αν έχεις ανοιχτές τις "κεραίες" σου\!
Generate a post summary using the ActivityPub functionality
Verify that Greek characters like "σου" are not corrupted to "σ?"
Run the test suite: npm run env-test -- --filter test_generate_post_summary
All tests should pass, including the new Greek text test case

Changelog entry

Automatically create a changelog entry from the details below.

Changelog Entry Details

Significance

Patch

Type

Fixed - for any bug fixes

Message

Fix multibyte character corruption in post summaries, preventing Greek and other non-ASCII text from being garbled during text processing.

Fixes character encoding issues where Greek text like "σου" was being corrupted to "σ?" in generated post summaries. The root cause was the sequence of html_entity_decode() followed by wp_strip_all_tags() which doesn't handle multibyte characters correctly. Changes: - Reorder text processing to use wp_strip_all_tags() before html_entity_decode() - Replace byte-based string functions with multibyte-safe equivalents (mb_strlen, mb_substr, etc.) - Add proper UTF-8 encoding parameters throughout the text processing pipeline - Improve word boundary detection for better truncation points - Add test case for Greek text to prevent regressions

obenland · 2025-07-23T20:01:40Z

Thanks for the feedback! I've addressed the redundant condition issue.

Regarding the magic numbers (12, 0.4, 0.8): I'm keeping these as inline values rather than constants because:

Contextual values: These are algorithm-specific thresholds for text truncation behavior, not configuration that would be reused elsewhere
Local scope: They only apply to this specific function's word boundary detection logic
WordPress conventions: The codebase generally uses inline numeric values for similar algorithm parameters
Comments provide context: The inline comments clearly explain what each threshold controls

Constants would be more appropriate if these values needed to be configurable or reused across multiple functions.

The shortcode test was expecting truncation at 'dolor' but the improved algorithm correctly includes 'sit' within the 25-character limit, providing better text truncation behavior.

Keep the essential encoding fixes (reorder processing, use multibyte functions) but revert to the simpler wordwrap-based truncation approach. The complex word boundary detection was an improvement but not necessary to fix the original character corruption issue.

Copilot

Pull Request Overview

This PR addresses multibyte character corruption in post summaries by replacing byte-based string functions with multibyte-safe equivalents throughout the codebase. The fix ensures that Greek text and other non-ASCII characters are properly handled during text processing operations.

Replaces strlen() with mb_strlen() for accurate character counting
Reorders HTML entity decoding and tag stripping operations for proper encoding handling
Adds UTF-8 encoding parameters to html_entity_decode() calls
Updates regex patterns with Unicode modifiers for multibyte character support

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
includes/functions.php	Core fix in `generate_post_summary()` function replacing byte-based operations with multibyte-safe equivalents
includes/transformer/class-post.php	Updates HTML entity decoding in image processing methods to preserve UTF-8 encoding
tests/includes/class-test-functions.php	Adds Greek text test case to verify multibyte character handling
tests/includes/class-test-shortcodes.php	Updates expected test output to reflect corrected excerpt behavior
.github/changelog/1995-from-description	Adds changelog entry documenting the bug fix

includes/functions.php

Co-authored-by: Matthias Pfefferle <[email protected]>

includes/functions.php

Co-authored-by: Matthias Pfefferle <[email protected]>

Copilot AI review requested due to automatic review settings July 23, 2025 19:56

Add changelog

653a167

This comment was marked as outdated.

Sign in to view

obenland requested a review from pfefferle July 23, 2025 19:57

obenland self-assigned this Jul 23, 2025

Remove redundant condition check in text truncation logic

bb48aec

obenland added 3 commits July 23, 2025 15:03

Update test expectation for improved word boundary detection

34b547f

The shortcode test was expecting truncation at 'dolor' but the improved algorithm correctly includes 'sit' within the 25-character limit, providing better text truncation behavior.

Remove comment

87a014c

obenland requested a review from Copilot July 23, 2025 20:14

Copilot AI reviewed Jul 23, 2025

View reviewed changes

includes/functions.php Show resolved Hide resolved

pfefferle reviewed Jul 24, 2025

View reviewed changes

includes/functions.php Outdated Show resolved Hide resolved

pfefferle previously approved these changes Jul 24, 2025

View reviewed changes

Update includes/functions.php

46c104b

Co-authored-by: Matthias Pfefferle <[email protected]>

obenland dismissed pfefferle’s stale review via 46c104b July 24, 2025 12:35

obenland requested a review from pfefferle July 24, 2025 12:35

Merge branch 'trunk' into fix/summaries

ee8a1c3

pfefferle previously approved these changes Jul 24, 2025

View reviewed changes

pfefferle reviewed Jul 24, 2025

View reviewed changes

includes/functions.php Outdated Show resolved Hide resolved

Update includes/functions.php

5ff0240

Co-authored-by: Matthias Pfefferle <[email protected]>

obenland dismissed pfefferle’s stale review via 5ff0240 July 24, 2025 12:45

obenland requested a review from pfefferle July 24, 2025 12:45

pfefferle approved these changes Jul 24, 2025

View reviewed changes

Merge branch 'trunk' into fix/summaries

8b67a21

obenland merged commit 70dc73c into trunk Jul 24, 2025
11 checks passed

obenland deleted the fix/summaries branch July 24, 2025 13:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix multibyte character corruption in post summaries #1995

Fix multibyte character corruption in post summaries #1995

Uh oh!

obenland commented Jul 23, 2025 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

obenland commented Jul 23, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fix multibyte character corruption in post summaries #1995

Fix multibyte character corruption in post summaries #1995

Uh oh!

Conversation

obenland commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes:

Other information:

Testing instructions:

Changelog entry

Significance

Type

Message

Uh oh!

This comment was marked as outdated.

Uh oh!

obenland commented Jul 23, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

obenland commented Jul 23, 2025 •

edited

Loading