parser/lexer: bump to Unicode 17, use faster unicode-ident #148321

Marcondiro · 2025-10-31T09:23:37Z

Hello,

Bump the unicode version used by lexer/parser to 17.0.0 by updating:

unicode-normalization to 0.1.25
unicode-properties to 0.1.4
unicode-width to 0.2.2

and by replacing unicode-xid with unicode-ident which is also 6 times faster.
I think it might be worth to run the benchmarks to double check.
(unicode-ident is already in src/tools/tidy/src/deps.rs)

Thanks!

rustbot · 2025-10-31T09:23:41Z

The list of allowed third-party dependencies may have been modified! You must ensure that any new dependencies have compatible licenses before merging.

cc @davidtwco, @wesleywiser

These commits modify the Cargo.lock file. Unintentional changes to Cargo.lock can be introduced when switching branches and rebasing PRs.

If this was unintentional then you should revert the changes before this PR is merged.
Otherwise, you can ignore this comment.

rustbot · 2025-10-31T09:23:43Z

r? @Mark-Simulacrum

rustbot has assigned @Mark-Simulacrum.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

rustbot · 2025-10-31T10:20:42Z

If the Unicode version changes are intentional,
it should also be updated in the reference at
https://github.com/rust-lang/reference/blob/HEAD/src/identifiers.md.

cc @ehuss

Kobzol · 2025-10-31T13:25:13Z

@bors try @rust-timer queue

(Parsing could be affected too).

parser/lexer: bump to Unicode 17, use faster unicode-ident

rust-bors · 2025-10-31T15:40:35Z

☀️ Try build successful (CI)
Build commit: 988451c (988451ce73b832a095adca69acf309ce27a2f54d, parent: 23c7bad921fb7163de37ea680bed317deaa03fda)

rust-timer · 2025-10-31T17:22:53Z

Finished benchmarking commit (988451c): comparison URL.

Overall result: ❌ regressions - no action needed

Benchmarking this pull request means it may be perf-sensitive – we'll automatically label it not fit for rolling up. You can override this, but we strongly advise not to, due to possible changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

Our most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	0.2%	[0.1%, 0.3%]	2
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-	-	0

Max RSS (memory usage)

This benchmark run did not return any relevant results for this metric.

Cycles

Results (primary 2.6%, secondary 3.6%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

	mean	range	count
Regressions ❌ (primary)	2.6%	[2.6%, 2.6%]	1
Regressions ❌ (secondary)	3.6%	[3.6%, 3.6%]	1
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	2.6%	[2.6%, 2.6%]	1

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: 473.971s -> 474.835s (0.18%)
Artifact size: 390.89 MiB -> 390.89 MiB (-0.00%)

clarfonthey · 2025-11-04T05:20:59Z

Is there a reason why the reference explicitly specifies the Unicode version in a way that makes it feel like updating that version is a nontrivial change?

i.e., is there a reason why it does not clarify that the Unicode version in the compiler is allowed to be (and should be) bumped whenever Unicode releases a new version, and to simply say something like "it is version N as of Rust 1.M"?

joshtriplett · 2025-11-05T16:16:40Z

I think it needs some level of review, to make sure that (for instance) the new Unicode version isn't doing anything out of the ordinary, and to make sure that some person in the project experienced with Unicode has taken at least a cursory look at the changes to XID_Start/XID_Continue and any related changes to confusables that overlap with XID_Start/XID_Continue.

I don't think that review is best done in lang. I think we should delgate that to whichever team is making sure of the above. (Is that T-compiler?) So, ideally, I'd love to see a proposal to lang requesting a delegation to take responsibility for the above.

That said, let's go ahead and sign off on this change to unblock it.

Manishearth · 2025-11-27T16:58:57Z

I'd be willing to help but I think for that to work the relevant teams would have to codify what they are looking for.

For reference, here's the difference between Unicode 16 and 17 for XID_Start:

https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3AU17%3AXID_Start%3A%5D+-+%5B%3AU16%3AXID_Start%3A%5D&g=&i=idstatus .

(and here are the new XID_Continue-only characters)

Is it a question of mismatched goals between Rust and Unicode, where Unicode's bar for "valid in identifiers" is subtly different from that of Rust? Is it a question of trust in Unicode's ability to hit that bar? Or something else?

The rough definition from the spec is this:

The formal syntax provided here captures the general intent that an identifier consists of a string of characters beginning with a letter or an ideograph, and followed by any number of letters, ideographs, digits, or underscores. It provides a definition of identifiers that is guaranteed to be backward compatible with each successive release of Unicode, but also adds any appropriate new Unicode characters.

If it's about Rust having a different bar, then Rust needs to articulate what that bar is. There's a good chance that we can even write tests against Unicode data that verify some of those properties. This is what we did in the non-ascii idents RFC: Rust articulated its concerns about it, and we were able to crystallize those into lints that use other Unicode properties (ones explicitly designed for this purpose, even).

If it's about Unicode's ability to hit that bar, I can certainly give such a list a once-over and check if there aren't any mistakes. But beyond that, I am literally on the body that assigns these properties, and if I needed help to verify such a thing (no one person understands all of the things in Unicode) I'd just be asking other people on that body, so I don't think that would be any sort of meaningful exercise.

For instance, unicode-width got updated to Unicode 16 few months after unicode-xid got its update. Or even right now the latest unicode-xid on crates.io is still based on Unicode 16 while normalization is on 17.

Note that this is mostly because unicode_rs crates get bumped when someone wants them to be. If Rust wanted them to be bumped, they can be bumped (best way: make a PR with the bump).

Marcondiro · 2025-12-01T10:49:12Z

tests/ui-fulldeps/lexer/unicode-version.rs

Since all the relevant crates seem to export some flavour of the UNICODE_VERSION constant included in the standard library, perhaps there could be a simple test somewhere that asserts that all these constants are identical? That way manual review of the versions isn't required.

At the moment there is this test in place, checking that rustc_lexer::UNICODE_IDENT_VERSION == rustc_parse::UNICODE_NORMALIZATION_VERSION

Would expanding this test to all the Unicode-version dependent crates used in /compiler/* be enough?

So, to clarify, this is exactly what I was proposing in this comment: #148321 (comment)

I was looking into it, and my particular method I was going with was just adding an anonymous constant in each of the crates to make a static assertion. So, for example, in rustc_lexer, this is what I currently have right now in my (unpublished) branch:

// ensure that unicode version is same as libstd const _: () = { let internal = std::char::UNICODE_VERSION; let properties = unicode_properties::UNICODE_VERSION; assert!(internal.0 as u64 == properties.0); assert!(internal.1 as u64 == properties.1); assert!(internal.2 as u64 == properties.2); }; const _: () = { let internal = std::char::UNICODE_VERSION; let xid = unicode_xid::UNICODE_VERSION; assert!(internal.0 as u64 == xid.0); assert!(internal.1 as u64 == xid.1); assert!(internal.2 as u64 == xid.2); };

(note: I'm using libstd's version as an anchor because it's the most convenient: if all of them match, then libstd should match. also, while I could use unstable features to make this simpler with assert_eq, I chose to go with something that works on stable atm just for ease of maintenance)

clarfonthey · 2025-12-02T05:00:59Z

Gonna preface this with: my main reason for having this discussion in the first place is, as I mostly mentioned, I think that bumping the Unicode version, since it is a routine change, should have some sort of playbook so we don't need to have these discussions again. And, so, I'm having these discussions here because I think the formation of said playbook is relevant, but if you'd rather move it somewhere else, I'm fine doing that too.

Note that this is mostly because unicode_rs crates get bumped when someone wants them to be. If Rust wanted them to be bumped, they can be bumped (best way: make a PR with the bump).

Not that this is the place to put this, but the exact reason why I submitted unicode-rs/unicode-script#25 (which you merged) is because Clippy uses unicode-script and it doesn't have a published version for Unicode 17, so, an attempt to enforce that all the Unicode versions match for clippy specifically fail. I imagine, then, if we were to enforce all of this, the procedure would then be to ping all of the authors of these crates, ensure that they're all updated to the latest version of Unicode, then perform the bump for all R-L crates? While this is also a request to do so for this particular crate, I'm also using it as an example since this would be part of what people should do going forward.

Right now, tidy only allows checking particular crates in Cargo.lock, and not explicit versions of them, so, we can't particularly enforce only versions which support the right Unicode version. This is a potential side concern of verifying the Unicode version is consistent: even though we can check the UNICODE_VERSION const for everything we use, we can't verify that dependencies also use the same version, which, even if it doesn't matter that much, is still probably better for the sake of reducing table size in binaries. So, I guess, that's also a side note here. I'm not sure exactly what the policy is for allowing dependencies in the compiler, but I assume that at least part of it is some assurance that the people who maintain them are at least active enough to merge changes if the compiler needs it.

There's also a potential consideration of just including the Unicode functionality most critical to the compiler directly in the compiler itself. Since all of the code used has a compatible license, it shouldn't be tremendously difficult to copy it in somewhere, and it would also mean that we could share some of the code that parses stuff like UnicodeData.txt between these various crates. This isn't to say it should be part of libstd, but maybe having it in the compiler to make version consistency easier is a consideration.

Marcondiro · 2025-12-13T17:10:05Z

Hello, I added some compile time checks as suggested by @clarfonthey.
Specifically, in rustc_parse and rustc_lexer compilation fails if the dependencies use different Unicode versions.
I did not implement the checks across the entire rust repository but only in the crates modified by this PR. I think the other checks might be addressed in another PR.
The resulting error print might not be the best but I stuck to stable const rust. Any suggestion on this or other parts of the PR are welcome.

r? @Mark-Simulacrum

Replace unicode-xid with unicode-ident which is 6 times faster

@clarfonthey

Add a compile time check in rustc_lexer and rustc_parse ensuring that unicode-related dependencies within the crate use the same unicode version. These checks are inspired by the examples privided by @clarfonthey.

rustbot · 2025-12-27T10:51:41Z

This PR was rebased onto a different main commit. Here's a range-diff highlighting what actually changed.

Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers.

crlf0710 · 2025-12-27T14:14:14Z

I think that bumping the Unicode version, since it is a routine change, should have some sort of playbook so we don't need to have these discussions again

#101840 could be an outdated (year 2022) example of such a playbook.

Mark-Simulacrum · 2025-12-28T20:56:23Z

@bors r+ rollup

bors · 2025-12-28T20:56:26Z

📌 Commit f7cb82e has been approved by Mark-Simulacrum

It is now in the queue for this repository.

parser/lexer: bump to Unicode 17, use faster unicode-ident Hello, Bump the unicode version used by lexer/parser to 17.0.0 by updating: - `unicode-normalization` to 0.1.25 - `unicode-properties` to 0.1.4 - `unicode-width` to 0.2.2 and by replacing `unicode-xid` with `unicode-ident` which is also 6 times faster. I think it might be worth to run the benchmarks to double check. (`unicode-ident` is already in `src/tools/tidy/src/deps.rs`) Thanks!

…uwer Rollup of 8 pull requests Successful merges: - #148321 (parser/lexer: bump to Unicode 17, use faster unicode-ident) - #149540 (std: sys: fs: uefi: Implement readdir) - #149582 (Implement `Duration::div_duration_{floor,ceil}`) - #149663 (Optimized implementation for uN::{gather,scatter}_bits) - #149667 (Fix ICE by rejecting const blocks in patterns during AST lowering (closes #148138)) - #149947 (add several older crashtests) - #150011 (Add more `unbounded_sh[lr]` examples) - #150411 (refactor `destructure_const`) r? `@ghost` `@rustbot` modify labels: rollup

Rollup merge of #148321 - Marcondiro:master, r=Mark-Simulacrum parser/lexer: bump to Unicode 17, use faster unicode-ident Hello, Bump the unicode version used by lexer/parser to 17.0.0 by updating: - `unicode-normalization` to 0.1.25 - `unicode-properties` to 0.1.4 - `unicode-width` to 0.2.2 and by replacing `unicode-xid` with `unicode-ident` which is also 6 times faster. I think it might be worth to run the benchmarks to double check. (`unicode-ident` is already in `src/tools/tidy/src/deps.rs`) Thanks!

JonathanBrouwer · 2025-12-29T10:33:49Z

This PR regressed perf in the rollup
#150469 (comment)

rustbot added A-tidy Area: The tidy tool S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 31, 2025

rustbot assigned Mark-Simulacrum Oct 31, 2025

rustbot added T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Oct 31, 2025

This comment has been minimized.

Sign in to view

Marcondiro force-pushed the master branch from a1d821c to 83d95c8 Compare October 31, 2025 10:20

Marcondiro mentioned this pull request Oct 31, 2025

identifiers: bump Unicode from 16 to 17 rust-lang/reference#2071

Merged

This comment has been minimized.

Sign in to view

rust-bors bot added a commit that referenced this pull request Oct 31, 2025

Auto merge of #148321 - Marcondiro:master, r=<try>

988451c

parser/lexer: bump to Unicode 17, use faster unicode-ident

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Oct 31, 2025

This comment has been minimized.

Sign in to view

rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Oct 31, 2025

crlf0710 added the A-Unicode Area: Unicode label Nov 4, 2025

traviscross removed T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) labels Nov 5, 2025

apiraino removed the to-announce Announce this issue on triage meeting label Nov 27, 2025

Marcondiro commented Dec 1, 2025

View reviewed changes

Marcondiro force-pushed the master branch from c21cf1d to 064ca73 Compare December 13, 2025 11:27

This comment has been minimized.

Sign in to view

Marcondiro force-pushed the master branch from 064ca73 to 32dbcf3 Compare December 13, 2025 16:24

This comment has been minimized.

Sign in to view

rustbot assigned Mark-Simulacrum and unassigned joshtriplett Dec 13, 2025

clubby789 mentioned this pull request Dec 21, 2025

Weekly cargo update #150225

Open

Marcondiro and others added 2 commits December 27, 2025 11:20

parser/lexer: bump to Unicode 17, use faster unicode-ident

ca64688

Replace unicode-xid with unicode-ident which is 6 times faster

lexer/parser: ensure deps use the same unicode version

f7cb82e

Add a compile time check in rustc_lexer and rustc_parse ensuring that unicode-related dependencies within the crate use the same unicode version. These checks are inspired by the examples privided by @clarfonthey.

Marcondiro force-pushed the master branch from 32dbcf3 to f7cb82e Compare December 27, 2025 10:51

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Dec 28, 2025

This was referenced Dec 28, 2025

Rollup of 6 pull requests #150467

Closed

Rollup of 8 pull requests #150469

Merged

bors merged commit 30618bb into rust-lang:main Dec 29, 2025
11 checks passed

rustbot added this to the 1.94.0 milestone Dec 29, 2025

JonathanBrouwer added the perf-regression Performance regression. label Dec 29, 2025

Uh oh!

parser/lexer: bump to Unicode 17, use faster unicode-ident #148321

parser/lexer: bump to Unicode 17, use faster unicode-ident #148321

Conversation

Marcondiro commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rustbot commented Oct 31, 2025

Uh oh!

rustbot commented Oct 31, 2025

Uh oh!

This comment has been minimized.

rustbot commented Oct 31, 2025

Uh oh!

Kobzol commented Oct 31, 2025

Uh oh!

This comment has been minimized.

This comment has been minimized.

rust-bors bot commented Oct 31, 2025

Uh oh!

This comment has been minimized.

rust-timer commented Oct 31, 2025

Overall result: ❌ regressions - no action needed

Instruction count

Max RSS (memory usage)

Cycles

Binary size

Uh oh!

clarfonthey commented Nov 4, 2025

Uh oh!

joshtriplett commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Manishearth commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Marcondiro Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

clarfonthey Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clarfonthey commented Dec 2, 2025

Uh oh!

This comment has been minimized.

This comment has been minimized.

Marcondiro commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rustbot commented Dec 27, 2025

Uh oh!

crlf0710 commented Dec 27, 2025

Uh oh!

Mark-Simulacrum commented Dec 28, 2025

Uh oh!

bors commented Dec 28, 2025

Uh oh!

Uh oh!

JonathanBrouwer commented Dec 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

18 participants

Marcondiro commented Oct 31, 2025 •

edited

Loading

joshtriplett commented Nov 5, 2025 •

edited

Loading

Manishearth commented Nov 27, 2025 •

edited

Loading

clarfonthey Dec 2, 2025 •

edited

Loading

Marcondiro commented Dec 13, 2025 •

edited

Loading