Skip to content

Fix multibyte character corruption in post summaries #1995

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jul 24, 2025
Merged

Conversation

obenland
Copy link
Member

@obenland obenland commented Jul 23, 2025

Fixes #1948

Proposed changes:

  • Fix character encoding corruption in generate_post_summary() function by reordering text processing operations.
  • Replace byte-based string functions with multibyte-safe equivalents throughout the codebase.
  • Add comprehensive test coverage for Greek text handling.

Other information:

  • Have you written new tests for your changes, if applicable?

Testing instructions:

  • Create a post with Greek text content like: Τι μπορεί να σου συμβεί σε μια βόλτα για να αγοράσεις μια βαλίτσα για τα ταξίδια σου; Όλα είναι πιθανά αν έχεις ανοιχτές τις "κεραίες" σου\!
  • Generate a post summary using the ActivityPub functionality
  • Verify that Greek characters like "σου" are not corrupted to "σ?"
  • Run the test suite: npm run env-test -- --filter test_generate_post_summary
  • All tests should pass, including the new Greek text test case

Changelog entry

  • Automatically create a changelog entry from the details below.
Changelog Entry Details

Significance

  • Patch

Type

  • Fixed - for any bug fixes

Message

Fix multibyte character corruption in post summaries, preventing Greek and other non-ASCII text from being garbled during text processing.

Fixes character encoding issues where Greek text like "σου" was being
corrupted to "σ?" in generated post summaries. The root cause was the
sequence of html_entity_decode() followed by wp_strip_all_tags() which
doesn't handle multibyte characters correctly.

Changes:
- Reorder text processing to use wp_strip_all_tags() before html_entity_decode()
- Replace byte-based string functions with multibyte-safe equivalents (mb_strlen, mb_substr, etc.)
- Add proper UTF-8 encoding parameters throughout the text processing pipeline
- Improve word boundary detection for better truncation points
- Add test case for Greek text to prevent regressions
@Copilot Copilot AI review requested due to automatic review settings July 23, 2025 19:56
Copilot

This comment was marked as outdated.

@obenland obenland requested a review from pfefferle July 23, 2025 19:57
@obenland obenland self-assigned this Jul 23, 2025
@obenland
Copy link
Member Author

Thanks for the feedback! I've addressed the redundant condition issue.

Regarding the magic numbers (12, 0.4, 0.8): I'm keeping these as inline values rather than constants because:

  1. Contextual values: These are algorithm-specific thresholds for text truncation behavior, not configuration that would be reused elsewhere
  2. Local scope: They only apply to this specific function's word boundary detection logic
  3. WordPress conventions: The codebase generally uses inline numeric values for similar algorithm parameters
  4. Comments provide context: The inline comments clearly explain what each threshold controls

Constants would be more appropriate if these values needed to be configurable or reused across multiple functions.

obenland added 3 commits July 23, 2025 15:03
The shortcode test was expecting truncation at 'dolor' but the improved
algorithm correctly includes 'sit' within the 25-character limit,
providing better text truncation behavior.
Keep the essential encoding fixes (reorder processing, use multibyte functions)
but revert to the simpler wordwrap-based truncation approach. The complex
word boundary detection was an improvement but not necessary to fix the
original character corruption issue.
@obenland obenland requested a review from Copilot July 23, 2025 20:14
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses multibyte character corruption in post summaries by replacing byte-based string functions with multibyte-safe equivalents throughout the codebase. The fix ensures that Greek text and other non-ASCII characters are properly handled during text processing operations.

  • Replaces strlen() with mb_strlen() for accurate character counting
  • Reorders HTML entity decoding and tag stripping operations for proper encoding handling
  • Adds UTF-8 encoding parameters to html_entity_decode() calls
  • Updates regex patterns with Unicode modifiers for multibyte character support

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
includes/functions.php Core fix in generate_post_summary() function replacing byte-based operations with multibyte-safe equivalents
includes/transformer/class-post.php Updates HTML entity decoding in image processing methods to preserve UTF-8 encoding
tests/includes/class-test-functions.php Adds Greek text test case to verify multibyte character handling
tests/includes/class-test-shortcodes.php Updates expected test output to reflect corrected excerpt behavior
.github/changelog/1995-from-description Adds changelog entry documenting the bug fix

pfefferle
pfefferle previously approved these changes Jul 24, 2025
Co-authored-by: Matthias Pfefferle <[email protected]>
pfefferle
pfefferle previously approved these changes Jul 24, 2025
Co-authored-by: Matthias Pfefferle <[email protected]>
@obenland obenland merged commit 70dc73c into trunk Jul 24, 2025
11 checks passed
@obenland obenland deleted the fix/summaries branch July 24, 2025 13:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Federated post excerpts break for non-English languages (e.g., Korean, Greek)
3 participants