Skip to content

Add Unicode Annex 31 methods to char #2693

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
107 changes: 107 additions & 0 deletions text/0000-char-uax-31.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
- Feature Name: `char_uax_31`
- Start Date: 2019-04-24
- RFC PR: [rust-lang/rfcs#0000](https://github.com/rust-lang/rfcs/pull/0000)
- Rust Issue: [rust-lang/rust#0000](https://github.com/rust-lang/rust/issues/0000)

# Summary
[summary]: #summary

Add functions to the standard library for testing a `char` against [UAX TR31](https://unicode.org/reports/tr31/) ("Unicode Annex 31")
`Pattern_White_Space`, `Pattern_Syntax`, `XID_Start`, `ID_Nonstart`, and `XID_Continue` (the XID ones are already in the standard
library, but are unstable; this RFC proposes to stablize them).

# Motivation
[motivation]: #motivation

As a systems language, Rust is heavily used for parsing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me this reads as motivation for why Rust needs such functions somewhere, but inclusion in the stdlib is a much higher bar, especially when this gets us tangled up with unicode versioning.

Yes, we already have is_whitespace, but that's an old API that was grandfathered in, and it has fewer unicode stability issues than XID.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One argument for it being in std is discoverability, but if you're parsing some grammar that grammar will likely tell you about XID.

As a progressive, forward-thinking language that accepts anyone,
Rust supports Unicode and makes the definitive string types UTF-8.
At the intersection of these needs sits *UAX #31: Unicode Identifier and Pattern Syntax* ("Annex 31"),
a standardized set of code point categories for defining computer language syntax.

This is being used in production Rust code already.
Rust's own compiler already has functions to check against Annex 31 code point categories in the lexer,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for the unstable, old, non_ascii_idents feature, which we have already RFCd to change. The change will still need some function like this, but it may need tailoring for bidi characters. This is true for other implementors too, the XID functions often need tailoring.

They're not super hard to tailor, though, so we could still expose these functions to match spec and let those tailoring just suffix the calls with || ch == ... && ch != ...

Copy link
Contributor Author

@notriddle notriddle May 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The XID part is currently unstable. But stable Rust does respect Pattern_White_Space, so there's already committed-to Annex 31-based syntax. https://internals.rust-lang.org/t/do-we-need-unicode-whitespace/9876

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a concrete case of tailoring, Rust actually uses XID_Start | '_' and not a plain XID_Start. I've actually got bitten by that when implementing lexer for IntelliJ Rust a while ago :)

[but not everyone who works on the compiler knows about them](https://internals.rust-lang.org/t/do-we-need-unicode-whitespace/9876),
and since they're not in the standard library,
not everyone who works on Rust-related tooling has access to them.
I'm not asserting that putting these in libstd would've avoided that bug,
but if it was in the standard library,
it would resolve the questions about whether third-party tooling can be expected to support the full range of Unicode whitespace.

[Other languages](https://rosettacode.org/wiki/Unicode_variable_names#C) also follow Annex 31, such as C# and Elixir.
Other common grammars, even ones that aren't actually for programming languages, can also be found or defined in Annex 31,
such as hashtags and XML.

It's also pretty clear what the "right" API is for this,
since `is_whitespace` and `is_ascii_whitespace` already set the precedent here,
so there's little need to experiment with API design.

# Guide-level explanation
[guide-level-explanation]: #guide-level-explanation

In addition to functions for checking "ASCII white space" and "Unicode white space,"
some languages, such as Rust and C#, use Unicode Annex 31 to define their syntax.
These functions are also exposed as methods on the `char` type.

# Reference-level explanation
[reference-level-explanation]: #reference-level-explanation

## `fn char::is_id_nonstart(self) -> bool`

Check if `self` is a member of Unicode Annex 31's `ID_Nonstart` code point category.
This function is defined as `self.is_xid_continue() && !self.is_xid_start()`.

## `fn char::is_pattern_syntax(self) -> bool`

Check if `self` is a member of Unicode Annex 31's `Pattern_Syntax` code point category.

## `fn char::is_pattern_white_space(self) -> bool`

Check if `self` is a member of Unicode Annex 31's `Pattern_White_Space` code point category.

# Drawbacks
[drawbacks]: #drawbacks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matching unicode versions is also a big issue here. This function will be inaccurate half the time as we don't always update our data files immediately (it's not always straightforward). Even if we do, the behavior of this function will change every year, and while we don't have a guarantee on stdlib behavior stability, this does mean that older compilers will lead to different results on code that compiles. This further makes me feel like this should be a versioned crate.

is_whitespace already opened the doors to this issue, but I don't want to make it worse. is_whitespace is a small relatively stable list whereas XID expands all the time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Added it.

The big problem, that has always made designing the text APIs hard,
is that it's not clear how much of Unicode we want to include in libstd.
The standard library certainly doesn't want a hashtag parser, even though Annex 31 describes one in section 6,
and libstd certainly doesn't want a character shaping algorithm,
even though Unicode places plenty of requirements on that process, too.

The other problem is that a lot of languages aren't defined in terms of Annex 31 anyway,
like Swift and HTML, which simply spell out the set of allowed code points themselves,
so this isn't necessarily useful to all of the language implementers.

The other big drawback is that Unicode changes, so keeping the standard library synced with it represents a backwards-
compatibility hazard. `is_whitespace` already has this problem, but the set of Unicode whitespace changes less
frequently than XID does, so the behavior of these functions would be expected to change more often.

# Rationale and alternatives
[rationale-and-alternatives]: #rationale-and-alternatives

The design was chosen to line up with how character classification is already being done (like `is_whitespace`).
The alternative, of providing a more generic classification API,
seems to have enough room for debate that it would be better served in crates that provide purpose-built frameworks.
In particular, proposal is made for the benefit of parsers, not text layout engines.
Those will still need to use things like `rust-unic`.

# Prior art
[prior-art]: #prior-art

There's already a crate that mostly provides this API, [unicode-xid](https://lib.rs/crates/unicode-xid),
but it's actually less comprehensive than this proposal (it only provides XID_Start and XID_Continue).

# Unresolved questions
[unresolved-questions]: #unresolved-questions

- What about ID_Start and ID_Continue? They're deprecated by the Unicode Consortium, but probably still useful for parsing some languages.
- `is_pattern_white_space`, like UAX 31 spells it? Or `is_pattern_whitespace`, for consistency with the rest of libstd?

# Future possibilities
[future-possibilities]: #future-possibilities

What does [Mosh](https://mosh.org/) use need to know for its UTF-8 handling?
Anything that's necessary to implement a correct UTF-8 enabled VT100 state machine seems applicable to Rust,
since that state machine is separate from the text shaping itself, but still has to know things like combining marks,
and what's necessary there is probably necessary for other, similar state machines like HTML and PDF,
where you have to pick out weird combining-mark corner cases.