Skip to content

fix: handle multi-byte UTF-8 characters in DeleteWhiteSpace#34

Open
Yanhu007 wants to merge 1 commit intoMasterminds:masterfrom
Yanhu007:fix/delete-whitespace-utf8
Open

fix: handle multi-byte UTF-8 characters in DeleteWhiteSpace#34
Yanhu007 wants to merge 1 commit intoMasterminds:masterfrom
Yanhu007:fix/delete-whitespace-utf8

Conversation

@Yanhu007
Copy link
Copy Markdown

Fixes #29

Problem

DeleteWhiteSpace iterates by byte index (str[i]) and casts each byte to rune. For multi-byte UTF-8 characters (Chinese, Japanese, Korean, emoji, etc.), this produces mojibake:

DeleteWhiteSpace(" 测试 测试 ")
// Got:    "æµ\x8dè¯\x95æµ\x8dè¯\x95"
// Expect: "测试测试"

Fix

Use range-based iteration which correctly yields Unicode runes instead of raw bytes.

// Before
for i := 0; i < sz; i++ {
    ch := rune(str[i])  // BUG: treats each byte as a rune

// After
for _, ch := range str {  // correctly iterates runes

All existing tests pass.

DeleteWhiteSpace iterated over string bytes (str[i]) and cast each
byte to rune, which corrupts multi-byte UTF-8 characters like
Chinese, Japanese, Korean, emoji, etc.

  DeleteWhiteSpace(" 测试 测试 ") → "æµ\x8dè¯\x95æµ\x8dè¯\x95"

Fix: use range-based iteration which correctly yields Unicode runes.

  DeleteWhiteSpace(" 测试 测试 ") → "测试测试"

Fixes Masterminds#29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug for DeleteWhiteSpace when str is other utf8 string, eg. chinese

1 participant