-
Notifications
You must be signed in to change notification settings - Fork 81
Fix multibyte character corruption in post summaries #1995
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Fixes character encoding issues where Greek text like "σου" was being corrupted to "σ?" in generated post summaries. The root cause was the sequence of html_entity_decode() followed by wp_strip_all_tags() which doesn't handle multibyte characters correctly. Changes: - Reorder text processing to use wp_strip_all_tags() before html_entity_decode() - Replace byte-based string functions with multibyte-safe equivalents (mb_strlen, mb_substr, etc.) - Add proper UTF-8 encoding parameters throughout the text processing pipeline - Improve word boundary detection for better truncation points - Add test case for Greek text to prevent regressions
Thanks for the feedback! I've addressed the redundant condition issue. Regarding the magic numbers (12, 0.4, 0.8): I'm keeping these as inline values rather than constants because:
Constants would be more appropriate if these values needed to be configurable or reused across multiple functions. |
The shortcode test was expecting truncation at 'dolor' but the improved algorithm correctly includes 'sit' within the 25-character limit, providing better text truncation behavior.
Keep the essential encoding fixes (reorder processing, use multibyte functions) but revert to the simpler wordwrap-based truncation approach. The complex word boundary detection was an improvement but not necessary to fix the original character corruption issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR addresses multibyte character corruption in post summaries by replacing byte-based string functions with multibyte-safe equivalents throughout the codebase. The fix ensures that Greek text and other non-ASCII characters are properly handled during text processing operations.
- Replaces
strlen()
withmb_strlen()
for accurate character counting - Reorders HTML entity decoding and tag stripping operations for proper encoding handling
- Adds UTF-8 encoding parameters to
html_entity_decode()
calls - Updates regex patterns with Unicode modifiers for multibyte character support
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
includes/functions.php | Core fix in generate_post_summary() function replacing byte-based operations with multibyte-safe equivalents |
includes/transformer/class-post.php | Updates HTML entity decoding in image processing methods to preserve UTF-8 encoding |
tests/includes/class-test-functions.php | Adds Greek text test case to verify multibyte character handling |
tests/includes/class-test-shortcodes.php | Updates expected test output to reflect corrected excerpt behavior |
.github/changelog/1995-from-description | Adds changelog entry documenting the bug fix |
Co-authored-by: Matthias Pfefferle <[email protected]>
Co-authored-by: Matthias Pfefferle <[email protected]>
Fixes #1948
Proposed changes:
generate_post_summary()
function by reordering text processing operations.Other information:
Testing instructions:
Τι μπορεί να σου συμβεί σε μια βόλτα για να αγοράσεις μια βαλίτσα για τα ταξίδια σου; Όλα είναι πιθανά αν έχεις ανοιχτές τις "κεραίες" σου\!
npm run env-test -- --filter test_generate_post_summary
Changelog entry
Changelog Entry Details
Significance
Type
Message
Fix multibyte character corruption in post summaries, preventing Greek and other non-ASCII text from being garbled during text processing.