Skip to content

Conversation

@dgud
Copy link
Contributor

@dgud dgud commented Nov 19, 2025

To help building scanners that adheres to the Unicode standard definitions.

@dgud
Copy link
Contributor Author

dgud commented Nov 19, 2025

Ping @potatosalad

@github-actions
Copy link
Contributor

github-actions bot commented Nov 19, 2025

CT Test Results

    2 files     97 suites   1h 8m 41s ⏱️
2 225 tests 2 174 ✅ 51 💤 0 ❌
2 615 runs  2 559 ✅ 56 💤 0 ❌

Results for commit 2e1f880.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

@jhogberg jhogberg added the team:PS Assigned to OTP team PS label Nov 19, 2025
@dgud dgud self-assigned this Nov 19, 2025
@dgud dgud force-pushed the dgud/stdlib/unicode-funcs/OTP-19858 branch 2 times, most recently from 76adee7 to bea3537 Compare November 20, 2025 14:22
@dgud dgud requested a review from Copilot November 20, 2025 15:35
Copilot finished reviewing on behalf of dgud November 20, 2025 15:39
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds Unicode character classification functionality to the stdlib, exposing new public APIs for categorizing Unicode characters and checking identifier properties according to Unicode Standard Annexes #31 and #44.

Key changes include:

  • Addition of four new public functions to the unicode module: is_whitespace/1, is_ID_start/1, is_ID_continue/1, and category/1
  • Distinction between Pattern White Space (used for parsing) and White Space (Unicode property), with unicode_util:whitespace/0 renamed to pattern_whitespace/0
  • Correction of the Unicode code point upper bound from 16#110000 to 16#10FFFF

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
lib/stdlib/uc_spec/gen_unicode_mod.escript Generator script updated to: (1) rename whitespace() to pattern_whitespace(), (2) fix IS_CP macro upper bound to 16#10FFFF, (3) distinguish between pattern_white_space and white_space properties
lib/stdlib/src/unicode.erl Adds four new public classification functions with documentation, exports category type, and implements validation logic
lib/stdlib/src/string.erl Updates calls from unicode_util:whitespace() to unicode_util:pattern_whitespace()
lib/stdlib/src/erl_stdlib_errors.erl Adds error message formatting for the four new unicode functions
lib/stdlib/test/unicode_util_SUITE.erl Updates whitespace test to use renamed pattern_whitespace() function
lib/stdlib/test/unicode_SUITE.erl Adds comprehensive test suite for new classification functions including category, is_whitespace, is_ID_start, and is_ID_continue; adds utility functions for parsing Unicode property files
lib/stdlib/test/Makefile Adds installation of Unicode property test data files (PropList.txt and DerivedCoreProperties.txt)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

When making scanners there might be a need to classify codepoints if
the specification follows the Unicode standard.

The following:
 is_whitespace/1,
 is_ID_start/1,
 is_ID_continue/1,
 category/1
is (at least) needed to scan and parse for example markdown.

This commit is backward incompatible if the *undocumented*
unicode_util:[is_]whitespace(..) functions have been used.

unicode_util:whitespace() have been changed to
unicode_util:pattern_whitespace() string module uses it and documents
that is uses the `pattern_whitespace` property.

So the name of the function have been changed, this is because
`pattern_whitespace` is not a subset of `whitespace` characters,
as was discovered during testing.

is_whitespace(Char) have been changed to use the Unicode `whitespace`
property.

See Unicode Standard Annex erlang#31 and erlang#44.
@dgud dgud force-pushed the dgud/stdlib/unicode-funcs/OTP-19858 branch from bea3537 to 2e1f880 Compare November 20, 2025 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

team:PS Assigned to OTP team PS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants