stdlib: Add (and expose) Unicode classifying functionality #10387

dgud · 2025-11-19T08:43:27Z

To help building scanners that adheres to the Unicode standard definitions.

dgud · 2025-11-19T08:43:44Z

Ping @potatosalad

github-actions · 2025-11-19T08:44:15Z

CT Test Results

2 files 97 suites 1h 8m 41s ⏱️
2 225 tests 2 174 ✅ 51 💤 0 ❌
2 615 runs 2 559 ✅ 56 💤 0 ❌

Results for commit 2e1f880.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

Copilot

Pull Request Overview

This PR adds Unicode character classification functionality to the stdlib, exposing new public APIs for categorizing Unicode characters and checking identifier properties according to Unicode Standard Annexes #31 and #44.

Key changes include:

Addition of four new public functions to the unicode module: is_whitespace/1, is_ID_start/1, is_ID_continue/1, and category/1
Distinction between Pattern White Space (used for parsing) and White Space (Unicode property), with unicode_util:whitespace/0 renamed to pattern_whitespace/0
Correction of the Unicode code point upper bound from 16#110000 to 16#10FFFF

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
lib/stdlib/uc_spec/gen_unicode_mod.escript	Generator script updated to: (1) rename `whitespace()` to `pattern_whitespace()`, (2) fix IS_CP macro upper bound to 16#10FFFF, (3) distinguish between pattern_white_space and white_space properties
lib/stdlib/src/unicode.erl	Adds four new public classification functions with documentation, exports category type, and implements validation logic
lib/stdlib/src/string.erl	Updates calls from `unicode_util:whitespace()` to `unicode_util:pattern_whitespace()`
lib/stdlib/src/erl_stdlib_errors.erl	Adds error message formatting for the four new unicode functions
lib/stdlib/test/unicode_util_SUITE.erl	Updates whitespace test to use renamed `pattern_whitespace()` function
lib/stdlib/test/unicode_SUITE.erl	Adds comprehensive test suite for new classification functions including category, is_whitespace, is_ID_start, and is_ID_continue; adds utility functions for parsing Unicode property files
lib/stdlib/test/Makefile	Adds installation of Unicode property test data files (PropList.txt and DerivedCoreProperties.txt)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

lib/stdlib/src/unicode.erl

lib/stdlib/test/unicode_SUITE.erl

lib/stdlib/test/unicode_util_SUITE.erl

When making scanners there might be a need to classify codepoints if the specification follows the Unicode standard. The following: is_whitespace/1, is_ID_start/1, is_ID_continue/1, category/1 is (at least) needed to scan and parse for example markdown. This commit is backward incompatible if the *undocumented* unicode_util:[is_]whitespace(..) functions have been used. unicode_util:whitespace() have been changed to unicode_util:pattern_whitespace() string module uses it and documents that is uses the `pattern_whitespace` property. So the name of the function have been changed, this is because `pattern_whitespace` is not a subset of `whitespace` characters, as was discovered during testing. is_whitespace(Char) have been changed to use the Unicode `whitespace` property. See Unicode Standard Annex erlang#31 and erlang#44.

jhogberg added the team:PS Assigned to OTP team PS label Nov 19, 2025

dgud self-assigned this Nov 19, 2025

dgud force-pushed the dgud/stdlib/unicode-funcs/OTP-19858 branch 2 times, most recently from 76adee7 to bea3537 Compare November 20, 2025 14:22

dgud requested a review from Copilot November 20, 2025 15:35

Copilot started reviewing on behalf of dgud November 20, 2025 15:35 View session

Copilot finished reviewing on behalf of dgud November 20, 2025 15:39

Copilot AI reviewed Nov 20, 2025

View reviewed changes

lib/stdlib/src/unicode.erl Outdated Show resolved Hide resolved

lib/stdlib/test/unicode_SUITE.erl Show resolved Hide resolved

lib/stdlib/test/unicode_util_SUITE.erl Show resolved Hide resolved

dgud force-pushed the dgud/stdlib/unicode-funcs/OTP-19858 branch from bea3537 to 2e1f880 Compare November 20, 2025 16:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

stdlib: Add (and expose) Unicode classifying functionality #10387

stdlib: Add (and expose) Unicode classifying functionality #10387

Uh oh!

dgud commented Nov 19, 2025

Uh oh!

dgud commented Nov 19, 2025

Uh oh!

github-actions bot commented Nov 19, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stdlib: Add (and expose) Unicode classifying functionality #10387

Are you sure you want to change the base?

stdlib: Add (and expose) Unicode classifying functionality #10387

Uh oh!

Conversation

dgud commented Nov 19, 2025

Uh oh!

dgud commented Nov 19, 2025

Uh oh!

github-actions bot commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CT Test Results

Artifacts

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Nov 19, 2025 •

edited

Loading