-
Notifications
You must be signed in to change notification settings - Fork 567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prepare mktables for Unicode 15.1 and 16.0 #23133
base: blead
Are you sure you want to change the base?
Changes from all commits
baaec01
9f8e157
c522b7e
ca2e9b7
e7730ad
b30f4cc
51f594d
de01c61
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -871,6 +871,15 @@ push @tables_that_may_be_empty, 'Grapheme_Cluster_Break=Prepend' | |
push @tables_that_may_be_empty, 'Canonical_Combining_Class=CCC133' | ||
if $v_version ge v6.2.0; | ||
|
||
# These properties of Egyptian hieroglyphs are not yet handled by Perl. Their | ||
# intended audience is only specialist Egyptologists | ||
push @tables_that_may_be_empty, qw(kEH_Cat kEH_Desc kEH_HG kEH_IFAO | ||
kEH_JSesh | ||
kEH_NoMirror kEH_NoMirror=Yes | ||
kEH_NoMirror=No | ||
kEH_NoRotate kEH_NoRotate=Yes) | ||
if $v_version ge v16.0.0; | ||
|
||
# The lists below are hashes, so the key is the item in the list, and the | ||
# value is the reason why it is in the list. This makes generation of | ||
# documentation easier. | ||
|
@@ -8811,7 +8820,7 @@ sub trace { return main::trace(@_) if main::DEBUG && $to_trace } | |
# filesystem to distinguish between, this is used to manually give short | ||
# names for the directory name immediately under $match_tables that the | ||
# match tables for this property should be placed in. | ||
main::set_access('match_subdir', \%match_subdir, 'r'); | ||
main::set_access('match_subdir', \%match_subdir, 'r', 's'); | ||
|
||
my %has_dependency; | ||
# A boolean that gives whether some table somewhere is defined as the | ||
|
@@ -10049,17 +10058,28 @@ sub finish_property_setup($file) { | |
# file directly (it was documented in 5.12 and 5.14 as being thusly | ||
# usable), keep it from being adjusted. (range_size_1 is | ||
# used to force the traditional format.) | ||
if (defined (my $nfkc_cf = property_ref('NFKC_Casefold'))) { | ||
$nfkc_cf->set_to_output_map($EXTERNAL_MAP); | ||
$nfkc_cf->set_range_size_1(1); | ||
foreach my $property (qw(NFKC_Casefold NFKC_Simple_Casefold)) { | ||
if (defined (my $cf = property_ref($property))) { | ||
$cf->set_to_output_map($EXTERNAL_MAP); | ||
$cf->set_range_size_1(1); | ||
} | ||
} | ||
|
||
if (defined (my $bmg = property_ref('Bidi_Mirroring_Glyph'))) { | ||
$bmg->set_to_output_map($EXTERNAL_MAP); | ||
$bmg->set_range_size_1(1); | ||
} | ||
|
||
property_ref('Numeric_Value')->set_to_output_map($OUTPUT_ADJUSTED); | ||
|
||
# These two properties have no short names and the file names for them | ||
# clash in DOS 8.3. Work around this by creating shorter file names that | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Where are we still limited by 8.3? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On IRC the other day, I asked if we were still limited, and the answer was yes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For unicode filenames yes, but for ASCII filenames we don't AFAIK. |
||
# work | ||
my $IDCMStart = property_ref("ID_Compat_Math_Start"); | ||
$IDCMStart->set_match_subdir("IDCMStart") if defined $IDCMStart; | ||
my $IDCMCont= property_ref("ID_Compat_Math_Continue"); | ||
$IDCMCont->set_match_subdir("IDCMContinue") if defined $IDCMCont; | ||
|
||
# The rest of this sub is for properties that need the Multi_Default class | ||
# to create objects for defaults. As of v15.0, this is no longer needed. | ||
|
||
|
@@ -13736,6 +13756,10 @@ END | |
next if $range->start == 0x1D7CE; # This whole range was added in 3.1 | ||
next if $range->end == 0x19DA && $v_version eq v5.2.0; | ||
next if $range->end - $range->start < 9 && $v_version le 4.0.0; | ||
|
||
# 2 sequential series of 10 each were added in 16.0 | ||
next if $range->start == 0x116D0 && $range->end == 0x116E3; | ||
|
||
Carp::my_carp("Range $range unexpectedly doesn't contain 10" | ||
. " decimal digits. Code in regcomp.c assumes it does," | ||
. " and will have to be fixed. Proceeding anyway."); | ||
|
@@ -15179,11 +15203,11 @@ END | |
|
||
# Perl tailors the WordBreak property so that \b{wb} doesn't split | ||
# adjacent spaces into separate words. Unicode 11.0 moved in that | ||
# direction, but left TAB, FIGURE SPACE (U+2007), and (ironically) NO | ||
# BREAK SPACE as breaking, so we retained the original Perl customization. | ||
# To do this, in the Perl copy of WB, simply replace the mappings of | ||
# horizontal space characters that otherwise would map to the default or | ||
# the 11.0 'WSegSpace' to instead map to our tailoring. | ||
# direction, but left TAB, FIGURE SPACE (U+2007), and (ironically) | ||
# NO_BREAK SPACE as breaking, so we retained the original Perl | ||
# customization. To do this, in the Perl copy of WB, simply replace the | ||
# mappings of horizontal space characters that otherwise would map to the | ||
# default or the 11.0 'WSegSpace' to instead map to our tailoring. | ||
my $perl_wb = property_ref('_Perl_WB'); | ||
my $default = $perl_wb->default_map; | ||
for my $range ($Blank->ranges) { | ||
|
@@ -15225,6 +15249,46 @@ END | |
} | ||
} | ||
|
||
# In Unicode 15.1, the InCB property was added, which causes us to have to | ||
# split GCB into subclasses that match various subclasses of InCB | ||
my $perl_gcb = property_ref('_Perl_GCB'); | ||
my $incb = property_ref('InCB'); | ||
if (defined $perl_gcb && defined $incb) { | ||
|
||
# For each class in GCB ... | ||
foreach my $gcb_table ($perl_gcb->tables) { | ||
my $gcb_name = $gcb_table->name; | ||
|
||
# ... we see if it has any code points that are in the three | ||
# classes of interest in INCB. | ||
foreach my $incb_table ($incb->table('Consonant'), | ||
$incb->table('Extend'), | ||
$incb->table('Linker')) | ||
{ | ||
my $intersection = $gcb_table & $incb_table; | ||
|
||
# If the intersection is empty, then nothing need be done. | ||
next unless $intersection->ranges; | ||
|
||
# Likewise if the intersection doesn't subtract anything, | ||
# nothing need be done. | ||
next if $gcb_table->matches_identically_to($intersection); | ||
|
||
# Otherwise, construct a new table consisting of the | ||
# intersection, removing its entries from the existing GCB | ||
# table. The name of the new table is the combination of the | ||
# GCB and InCB table names | ||
my $incb_name = $incb_table->name; | ||
my $combined_name = "${gcb_name}_$incb_name"; | ||
|
||
foreach my $range ($intersection->ranges) { | ||
$perl_gcb->replace_map($range->start, $range->end, | ||
$combined_name); | ||
} | ||
} | ||
} | ||
} | ||
|
||
# Create a version of the LineBreak property with the mappings that are | ||
# omitted in the default algorithm remapped to what | ||
# http://www.unicode.org/reports/tr14 says they should be. | ||
|
@@ -15324,8 +15388,27 @@ END | |
} | ||
} | ||
} | ||
elsif ($v_version ge 15.1.0 && $value eq standardize('Quotation')) { | ||
|
||
# Unicode 15.1 splits LB=QU initial quotes and final quotes, and | ||
# regular quotes | ||
for my $i ($range->start .. $range->end) { | ||
my $gc_val = $gc->value_of($i); | ||
if ($gc_val eq 'Pi') { | ||
$perl_lb->replace_map($i, $i, "Initial_Quote"); | ||
} | ||
elsif ($gc_val eq 'Pf') { | ||
$perl_lb->replace_map($i, $i, "Final_Quote"); | ||
} | ||
} | ||
} | ||
} | ||
|
||
# This is an Alphabetic, but it doesn't need to be split off, because no | ||
# current rule involving Alphabetics requires not including this. | ||
$perl_lb->replace_map(0x25CC, 0x25CC, "Dotted_Circle") | ||
if $v_version ge 15.1.0; | ||
|
||
# This property is a modification of the scx property | ||
my $perl_scx = Property->new('_Perl_SCX', | ||
Fate => $INTERNAL_ONLY, | ||
|
@@ -19778,13 +19861,21 @@ my @input_file_objects = ( | |
), | ||
Input_file->new('IdStatus.txt', v13.0.0, | ||
Pre_Handler => \&setup_IdStatus, | ||
Has_Missings_Defaults => $IGNORED, | ||
Property => 'Identifier_Status', | ||
|
||
# Part of UTS 39, so must be downloaded separately from | ||
# unicode.org | ||
UCD => 0, | ||
), | ||
Input_file->new('IdType.txt', v13.0.0, | ||
Pre_Handler => \&setup_IdType, | ||
Has_Missings_Defaults => $IGNORED, | ||
Each_Line_Handler => \&filter_IdType_line, | ||
Property => 'Identifier_Type', | ||
|
||
# Part of UTS 39, so must be downloaded separately from | ||
# unicode.org | ||
UCD => 0, | ||
), | ||
Input_file->new('confusables.txt', v15.0.0, | ||
|
@@ -19799,6 +19890,18 @@ my @input_file_objects = ( | |
Skip => $Unused_Skip, | ||
UCD => 0, | ||
), | ||
Input_file->new('Unikemet.txt', v16.0.0, | ||
# For Egyptian Hieroglyphs; is in an alien format to the | ||
# other files Unicode furnishes. | ||
Skip => $Unused_Skip, | ||
UCD => 0, | ||
), | ||
Input_file->new('DoNotEmit.txt', v16.0.0, | ||
# Advice about characters that are unwise to create; not | ||
# any properties, though we could create some. | ||
Skip => $Unused_Skip, | ||
UCD => 0, | ||
), | ||
); | ||
|
||
# End of all the preliminaries. | ||
|
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it do? And why would we not want to support it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it important to get the next Perl version shipped with the latest Unicode release, and I think in order to do this, it has to be in the the upcoming development release due out in the next day or two. Getting this to work in time is lower priority than getting the rest to work in time. These could be legally fixed in the next development release next month. And since the bus factor for getting it in is 1, I don't think the comments should promise anything.
For information, see https://www.unicode.org/reports/tr57/tr57-3.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"are not handled by Perl" is ambiguous. It could be read as "are not to be handled" (so don't add them) or "are not handled yet".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to
These properties of Egyptian hieroglyphs are not yet handled by Perl. Their