-
Notifications
You must be signed in to change notification settings - Fork 3k
stdlib: Add (and expose) Unicode classifying functionality #10387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Ping @potatosalad |
CT Test Results 2 files 97 suites 1h 8m 41s ⏱️ Results for commit 2e1f880. ♻️ This comment has been updated with latest results. To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass. See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally. Artifacts
// Erlang/OTP Github Action Bot |
76adee7 to
bea3537
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds Unicode character classification functionality to the stdlib, exposing new public APIs for categorizing Unicode characters and checking identifier properties according to Unicode Standard Annexes #31 and #44.
Key changes include:
- Addition of four new public functions to the
unicodemodule:is_whitespace/1,is_ID_start/1,is_ID_continue/1, andcategory/1 - Distinction between Pattern White Space (used for parsing) and White Space (Unicode property), with
unicode_util:whitespace/0renamed topattern_whitespace/0 - Correction of the Unicode code point upper bound from
16#110000to16#10FFFF
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| lib/stdlib/uc_spec/gen_unicode_mod.escript | Generator script updated to: (1) rename whitespace() to pattern_whitespace(), (2) fix IS_CP macro upper bound to 16#10FFFF, (3) distinguish between pattern_white_space and white_space properties |
| lib/stdlib/src/unicode.erl | Adds four new public classification functions with documentation, exports category type, and implements validation logic |
| lib/stdlib/src/string.erl | Updates calls from unicode_util:whitespace() to unicode_util:pattern_whitespace() |
| lib/stdlib/src/erl_stdlib_errors.erl | Adds error message formatting for the four new unicode functions |
| lib/stdlib/test/unicode_util_SUITE.erl | Updates whitespace test to use renamed pattern_whitespace() function |
| lib/stdlib/test/unicode_SUITE.erl | Adds comprehensive test suite for new classification functions including category, is_whitespace, is_ID_start, and is_ID_continue; adds utility functions for parsing Unicode property files |
| lib/stdlib/test/Makefile | Adds installation of Unicode property test data files (PropList.txt and DerivedCoreProperties.txt) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
When making scanners there might be a need to classify codepoints if the specification follows the Unicode standard. The following: is_whitespace/1, is_ID_start/1, is_ID_continue/1, category/1 is (at least) needed to scan and parse for example markdown. This commit is backward incompatible if the *undocumented* unicode_util:[is_]whitespace(..) functions have been used. unicode_util:whitespace() have been changed to unicode_util:pattern_whitespace() string module uses it and documents that is uses the `pattern_whitespace` property. So the name of the function have been changed, this is because `pattern_whitespace` is not a subset of `whitespace` characters, as was discovered during testing. is_whitespace(Char) have been changed to use the Unicode `whitespace` property. See Unicode Standard Annex erlang#31 and erlang#44.
bea3537 to
2e1f880
Compare
To help building scanners that adheres to the Unicode standard definitions.