Skip to content

Conversation

msdemlei
Copy link
Contributor

@msdemlei msdemlei commented Jun 11, 2025

UCS-2 as such is basically not implemented anywhere any more. It's all UTF-16, and I say we need to acknowledge that.

Regrettably, the variable-length encoding of UTF-16 won't work for us because we need fixed lengths für the strings in VOTable BINARY2. That's why I have a TODO in here.

We could require parsers to read the UTF-16 strings and identify surrogate pairs, but that would be terrible in all ways.

To get out of this fix, we could say that arraysize represents the encoded length rather than the number of unicode codepoints. I think I'd consider that reasonable.

Alternatively, we say "you can't have non-BMP characters in unicodeChar and hence no surrogate pairs. VOTable parsers must fail when they are asked to encode anything outside of the BMP or containing surrogate characters". Hm 💩. For clarity, let me stress that basically all emojis are outside of the BMP.

See also https://wiki.ivoa.net/internal/IVOA/InterOpJune2025Apps/unicode-notes.pdf and bug #69.

msdemlei added 2 commits June 11, 2025 13:13
But that won't work easily as we can no longer reliably compute the
length of such fields, at least not without parsing them.

So, there's a TODO in here.

See also https://wiki.ivoa.net/internal/IVOA/InterOpJune2025Apps/unicode-notes.pdf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant