Skip to content

Commit 38ecdd2

Browse files
committed
Generalize scripts and add RHEL support, add two new string patterns, add total locale count
1 parent 3476950 commit 38ecdd2

39 files changed

+233376
-60
lines changed

README.md

+44-12
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11

2-
# Collation Changes Across Ubuntu Versions
2+
# Collation Changes Across Linux Versions
33

44
## Methodology
55

@@ -9,22 +9,26 @@ locale data files.
99

1010
Comparing the results of actual sorts should catch any changes to default
1111
sorting which is not defined in the OS collation data. A simple perl script is
12-
used to generate a text file containing 14 different strings for every legal
12+
used to generate a text file containing 16 different strings for every legal
1313
unicode character. The unix "sort" utility processes this file with the locale
14-
configured to en_US for collation. This process is repeated on each Ubuntu
14+
configured to en_US for collation. This process is repeated on each
1515
release from the past 10 years, and then the unix "diff" utility is used to
1616
compare the sorted output files and count how many characters have different
1717
positions after sorting. The results show how many individual code points have
1818
changed positions in the sorted data across different Operating System releases
1919
and which Unicode Blocks contain the changed code points.
2020

21-
The Operating System locale data files are compared directly. The results show
22-
the total number of lines in the data files that are changed, and which locales
23-
contain the changes.
21+
The Operating System locale data files from `/usr/share/i18n/locales` are
22+
compared directly. The results show the total number of lines in the data files
23+
that are changed, and which locales contain the changes.
2424

2525

2626
## Results
2727

28+
### Ubuntu
29+
30+
*Note: Generated with an older version of scripts; not yet updated. This Ubuntu table may be missing some changes.*
31+
2832
| GLIBC Version | Total Detected en_US Sort Order Changes | Unicode Blocks of Detected en_US Sort Order Changes | Total Detected Collation Data File Changes | Locales of Detected Data File Changes | Operating System | AMI |
2933
| ---- | ---- | ---- | ---- | ---- | ---- | ---- |
3034
| 2.11.1-0ubuntu7.10 | | | | | Ubuntu 10.04.4 LTS | [ami-0baf7662](_ubuntu/ami-0baf7662) |
@@ -53,13 +57,30 @@ contain the changes.
5357
| 2.34-0ubuntu3 | 0 | | [2](_ubuntu/ami-00482f016b2410dc8/changelist_locales_from-2.33-0ubuntu5_to-2.34-0ubuntu3.txt) | sv_SE | Ubuntu 21.10 | [ami-00482f016b2410dc8](_ubuntu/ami-00482f016b2410dc8) |
5458

5559

60+
### Red Hat Enterprise Linux
61+
62+
| GLIBC Version | Total Detected en_US Sort Order Changes | Unicode Blocks of Detected en_US Sort Order Changes | Total Detected Collation Data File Changes | Locales of Detected Data File Changes | Number of Locales | Operating System | AMI |
63+
| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
64+
| 2.5-49.el5_5.7 | | | | | 231 | Red Hat Enterprise Linux Server release 5.5 (Tikanga) | [ami-eb84ed82](_rhel/ami-eb84ed82) |
65+
| 2.5-1232.5-123 | 0 | | 0 | | 231 | Red Hat Enterprise Linux Server release 5.11 (Tikanga) | [ami-3268da5a](_rhel/ami-3268da5a) |
66+
| 2.12-1.7.el6_0.8 | [22908](_rhel/ami-09680160/changelist_en-US_from-2.5-1232.5-123_to-2.12-1.7.el6_0.8.txt) | 4 Basic Latin, 10 Latin-1 Supplement, 18 Latin Extended-A, 131 Latin Extended-B, 9 IPA Extensions, 206 Cyrillic, 16 Cyrillic Supplement, 76 Armenian, 26 Hebrew, 45 Arabic, 108 Devanagari, 86 Bengali, 79 Gurmukhi, 82 Gujarati, 58 Tamil, 93 Telugu, 86 Kannada, 82 Malayalam, 80 Sinhala, 130 Myanmar, 82 Georgian, 246 Latin Extended Additional, 1 Miscellaneous Symbols, 38 Georgian Supplement, 55 Tifinagh, 20902 CJK Unified Ideographs, 34 Arabic Presentation Forms-A, 125 Arabic Presentation Forms-B | [16282](_rhel/ami-09680160/changelist_locales_from-2.5-1232.5-123_to-2.12-1.7.el6_0.8.txt) | (More than 20 languages) | 275 | Red Hat Enterprise Linux Server release 6.0 (Santiago) | [ami-09680160](_rhel/ami-09680160) |
67+
| 2.12-1.212.el6_10.3 | 0 | | [42](_rhel/ami-0351faf7328fdb373/changelist_locales_from-2.12-1.7.el6_0.8_to-2.12-1.212.el6_10.3.txt) | fi_FI | 275 | Red Hat Enterprise Linux Server release 6.10 (Santiago) | [ami-0351faf7328fdb373](_rhel/ami-0351faf7328fdb373) |
68+
| 2.17-55.el7_0.5 | [107](_rhel/ami-60a1e808/changelist_en-US_from-2.12-1.212.el6_10.3_to-2.17-55.el7_0.5.txt) | 107 Tibetan | [2168](_rhel/ami-60a1e808/changelist_locales_from-2.12-1.212.el6_10.3_to-2.17-55.el7_0.5.txt) | dz_BT, hu_HU, iso14651_t1_common, se_NO, ug_CN, no_NO (removed) | 300 | Red Hat Enterprise Linux Server release 7.0 (Maipo) | [ami-60a1e808](_rhel/ami-60a1e808) |
69+
| 2.17-317.el7 | 0 | | 0 | | 300 | Red Hat Enterprise Linux Server release 7.9 (Maipo) | [ami-005b7876121b7244d](_rhel/ami-005b7876121b7244d) |
70+
| 2.28-42.el8_0.1 | [282167](_rhel/ami-043fbed28a389c721/changelist_en-US_from-2.17-317.el7_to-2.28-42.el8_0.1.txt) | (Blocks not listed for this many en_US sort order changes) | [112164](_rhel/ami-043fbed28a389c721/changelist_locales_from-2.17-317.el7_to-2.28-42.el8_0.1.txt) | (More than 20 languages) | 341 | Red Hat Enterprise Linux release 8.0 (Ootpa) | [ami-043fbed28a389c721](_rhel/ami-043fbed28a389c721) |
71+
| 2.28-164.el8 | 0 | | [10](_rhel/ami-06644055bed38ebd9/changelist_locales_from-2.28-42.el8_0.1_to-2.28-164.el8.txt) | C | 341 | Red Hat Enterprise Linux release 8.5 (Ootpa) | [ami-06644055bed38ebd9](_rhel/ami-06644055bed38ebd9) |
72+
| 2.34-7.el9_b | 0 | | [543](_rhel/ami-0fb33ec3ead0b8e3f/changelist_locales_from-2.28-164.el8_to-2.34-7.el9_b.txt) | C, or_IN, sv_SE | 343 | Red Hat Enterprise Linux release 9.0 Beta (Plow) | [ami-0fb33ec3ead0b8e3f](_rhel/ami-0fb33ec3ead0b8e3f) |
73+
74+
5675
## Generated Strings for en_US Sort Order Comparison
5776

58-
For every legal unicode code point, the following 14 string patterns are generated:
77+
For every legal unicode code point, the following 16 string patterns are generated:
5978

6079
```
6180
🍷
6281
🍷🍷
82+
1B-🍷B
83+
1B🍷B
6384
🍷🍷B
6485
🍷B
6586
🍷🍷BB
@@ -74,9 +95,9 @@ D🍷🍷D
7495
D🍷D
7596
```
7697

77-
Note that the string patterns are listed above in correctly sorted order. This
78-
alone should give some sense about the sophistication of collation rules, and
79-
the difficulty of writing a test to catch changes.
98+
Note that the string patterns are listed above in Red Hat Enterprise Linux 8
99+
correctly sorted order. This alone should give some sense about the sophistication
100+
of collation rules, and the difficulty of writing a test to catch changes.
80101

81102

82103
## Caveats
@@ -136,9 +157,20 @@ $ ( echo 1-; echo 11; echo 1-1; echo 111; echo 1a; echo 1b; echo 1-aa; echo 1-a)
136157
The script `table.sh` generates the table above.
137158

138159
The data is generated by running the following command using the DNS or IP of a
139-
ubuntu server:
160+
linux server:
161+
162+
```
163+
test-host.sh [ubuntu|rhel] $USER@$HOST
164+
```
165+
166+
I searched public community AMIs on AWS to find old versions of linux. Older
167+
versions of RHEL might not have an ec2-user account (I just used root), and
168+
newer versions of RHEL might not come with perl or glibc-locale-source installed
169+
by default. Newer versions of Ubuntu require keyboard input when running some
170+
dpkg commands (a warning about this appears when running the `test-host.sh` script).
140171

141172
```
142-
ubuntu.sh $HOST
173+
sudo yum install perl
174+
sudo yum install glibc-locale-source-$(rpm -q glibc --queryformat '%{version}-%{release}')
143175
```
144176

Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
CodePoint String PositionChange

_rhel/ami-005b7876121b7244d/changelist_locales_from-2.17-55.el7_0.5_to-2.17-317.el7.txt

Whitespace-only changes.

_rhel/ami-005b7876121b7244d/run.out

+183
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,183 @@
1+
+ export LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8
2+
+ LANG=en_US.UTF-8
3+
+ LC_ALL=en_US.UTF-8
4+
+ date
5+
Fri Dec 17 01:40:05 UTC 2021
6+
++ dirname glibc-unicode-sorting/run.sh
7+
+ cd glibc-unicode-sorting
8+
+ pwd
9+
/home/ec2-user/glibc-unicode-sorting
10+
+ which dpkg
11+
which: no dpkg in (/usr/local/bin:/usr/bin)
12+
+ which rpm
13+
/usr/bin/rpm
14+
+ grep -E '(glibc|langpack)'
15+
+ rpm -qa
16+
glibc-2.17-317.el7.x86_64
17+
glibc-common-2.17-317.el7.x86_64
18+
++ curl -s http://169.254.169.254/latest/meta-data/ami-id
19+
+ SOURCE_AMI=ami-005b7876121b7244d
20+
++ cat /etc/issue
21+
+ OS_VERS='\S
22+
Kernel \r on an \m'
23+
+ UNICODE_VERS=14
24+
+ which dpkg
25+
which: no dpkg in (/usr/local/bin:/usr/bin)
26+
+ which rpm
27+
/usr/bin/rpm
28+
++ rpm -q glibc --queryformat '%{version}-%{release}'
29+
+ GLIBC_VERS=2.17-317.el7
30+
+ '[' -f /etc/os-release ']'
31+
+ cat /etc/os-release
32+
NAME="Red Hat Enterprise Linux Server"
33+
VERSION="7.9 (Maipo)"
34+
ID="rhel"
35+
ID_LIKE="fedora"
36+
VARIANT="Server"
37+
VARIANT_ID="server"
38+
VERSION_ID="7.9"
39+
PRETTY_NAME="Red Hat Enterprise Linux Server 7.9 (Maipo)"
40+
ANSI_COLOR="0;31"
41+
CPE_NAME="cpe:/o:redhat:enterprise_linux:7.9:GA:server"
42+
HOME_URL="https://www.redhat.com/"
43+
BUG_REPORT_URL="https://bugzilla.redhat.com/"
44+
45+
REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
46+
REDHAT_BUGZILLA_PRODUCT_VERSION=7.9
47+
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
48+
REDHAT_SUPPORT_PRODUCT_VERSION="7.9"
49+
+ '[' -f /etc/system-release ']'
50+
+ cat /etc/system-release
51+
Red Hat Enterprise Linux Server release 7.9 (Maipo)
52+
+ '[' -f /etc/system-release-cpe ']'
53+
+ cat /etc/system-release-cpe
54+
cpe:/o:redhat:enterprise_linux:7.9:ga:server
55+
+ curl -kO https://www.unicode.org/Public/14.0.0/ucd/UnicodeData.txt
56+
% Total % Received % Xferd Average Speed Time Time Time Current
57+
Dload Upload Total Spent Left Speed
58+
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0100 1853k 100 1853k 0 0 3206k 0 --:--:-- --:--:-- --:--:-- 3211k
59+
60+
real 0m0.583s
61+
user 0m0.088s
62+
sys 0m0.073s
63+
+ perl '-naF;' -CO '-e
64+
sub pr3 {printf("%s%08x\n",$_[1],$_[0])}
65+
sub pr2 {pr3($_[0],"B".$_[1]."B");pr3($_[0],"D".$_[1]."D");pr3($_[0],$_[1]);pr3($_[0],$_[1]."B");pr3($_[0],$_[1]."BB");pr3($_[0],$_[1]."D");pr3($_[0],$_[1]."DD")}
66+
sub pr {pr2($_[0],$_[1].chr($_[0]));pr2($_[0],$_[1].chr($_[0]).chr($_[0]));pr3($_[0],"1B".chr($_[0])."B");pr3($_[0],"1B-".chr($_[0])."B")}
67+
if(/<control>/){next}; # skip control characters
68+
if($F[2] eq "Cs"){next}; # skip surrogates
69+
if(/ First>/){$fi=hex("0x".$F[0]);next}; # generate blocks
70+
if(/ Last>/){$la=hex("0x".$F[0]);for($fi..$la){pr($_)};next};
71+
pr(hex("0x".$F[0])) # generate individual characters
72+
' UnicodeData.txt
73+
+ split -l500000 - _base-characters
74+
75+
real 0m5.418s
76+
user 0m5.303s
77+
sys 0m0.071s
78+
+ wc _base-charactersaa _base-charactersab _base-charactersac _base-charactersad _base-charactersae _base-charactersaf _base-charactersag _base-charactersah _base-charactersai _base-charactersaj
79+
500000 500090 7453517 _base-charactersaa
80+
500000 500000 7512259 _base-charactersab
81+
500000 500000 8218750 _base-charactersac
82+
500000 500000 8218750 _base-charactersad
83+
500000 500000 8218750 _base-charactersae
84+
500000 500000 8218750 _base-charactersaf
85+
500000 500000 8218750 _base-charactersag
86+
500000 500000 8218750 _base-charactersah
87+
500000 500000 8218750 _base-charactersai
88+
14640 14640 240645 _base-charactersaj
89+
4514640 4514730 72737671 total
90+
+ locale
91+
LANG=en_US.UTF-8
92+
LC_CTYPE="en_US.UTF-8"
93+
LC_NUMERIC="en_US.UTF-8"
94+
LC_TIME="en_US.UTF-8"
95+
LC_COLLATE="en_US.UTF-8"
96+
LC_MONETARY="en_US.UTF-8"
97+
LC_MESSAGES="en_US.UTF-8"
98+
LC_PAPER="en_US.UTF-8"
99+
LC_NAME="en_US.UTF-8"
100+
LC_ADDRESS="en_US.UTF-8"
101+
LC_TELEPHONE="en_US.UTF-8"
102+
LC_MEASUREMENT="en_US.UTF-8"
103+
LC_IDENTIFICATION="en_US.UTF-8"
104+
LC_ALL=en_US.UTF-8
105+
+ date
106+
Fri Dec 17 01:40:12 UTC 2021
107+
++ ls -1 _base-charactersaa _base-charactersab _base-charactersac _base-charactersad _base-charactersae _base-charactersaf _base-charactersag _base-charactersah _base-charactersai _base-charactersaj
108+
+ for FILE in '$(ls -1 _base-characters*)'
109+
+ for FILE in '$(ls -1 _base-characters*)'
110+
+ for FILE in '$(ls -1 _base-characters*)'
111+
+ for FILE in '$(ls -1 _base-characters*)'
112+
+ for FILE in '$(ls -1 _base-characters*)'
113+
+ for FILE in '$(ls -1 _base-characters*)'
114+
+ for FILE in '$(ls -1 _base-characters*)'
115+
+ for FILE in '$(ls -1 _base-characters*)'
116+
+ for FILE in '$(ls -1 _base-characters*)'
117+
+ for FILE in '$(ls -1 _base-characters*)'
118+
+ jobs
119+
[1] Running sort $FILE -o _s$FILE &
120+
[2] Running sort $FILE -o _s$FILE &
121+
[3] Running sort $FILE -o _s$FILE &
122+
[4] Running sort $FILE -o _s$FILE &
123+
[5] Running sort $FILE -o _s$FILE &
124+
[6] Running sort $FILE -o _s$FILE &
125+
[7] Running sort $FILE -o _s$FILE &
126+
[8] Running sort $FILE -o _s$FILE &
127+
[9]- Running sort $FILE -o _s$FILE &
128+
[10]+ Running sort $FILE -o _s$FILE &
129+
+ wait
130+
+ sort _base-charactersaj -o _s_base-charactersaj
131+
+ sort _base-charactersae -o _s_base-charactersae
132+
+ sort _base-charactersaf -o _s_base-charactersaf
133+
+ sort _base-charactersaa -o _s_base-charactersaa
134+
+ sort _base-charactersab -o _s_base-charactersab
135+
+ sort _base-charactersac -o _s_base-charactersac
136+
+ sort _base-charactersag -o _s_base-charactersag
137+
+ sort _base-charactersah -o _s_base-charactersah
138+
+ sort _base-charactersai -o _s_base-charactersai
139+
+ sort _base-charactersad -o _s_base-charactersad
140+
+ date
141+
Fri Dec 17 01:40:48 UTC 2021
142+
+ sort -m _s_base-charactersaa _s_base-charactersab _s_base-charactersac _s_base-charactersad _s_base-charactersae _s_base-charactersaf _s_base-charactersag _s_base-charactersah _s_base-charactersai _s_base-charactersaj -o unicode-14-chars-sorted-glibc-2.17-317.el7.txt
143+
144+
real 0m2.245s
145+
user 0m2.164s
146+
sys 0m0.069s
147+
+ rm -v _base-charactersaa _base-charactersab _base-charactersac _base-charactersad _base-charactersae _base-charactersaf _base-charactersag _base-charactersah _base-charactersai _base-charactersaj _s_base-charactersaa _s_base-charactersab _s_base-charactersac _s_base-charactersad _s_base-charactersae _s_base-charactersaf _s_base-charactersag _s_base-charactersah _s_base-charactersai _s_base-charactersaj UnicodeData.txt
148+
removed ‘_base-charactersaa’
149+
removed ‘_base-charactersab’
150+
removed ‘_base-charactersac’
151+
removed ‘_base-charactersad’
152+
removed ‘_base-charactersae’
153+
removed ‘_base-charactersaf’
154+
removed ‘_base-charactersag’
155+
removed ‘_base-charactersah’
156+
removed ‘_base-charactersai’
157+
removed ‘_base-charactersaj’
158+
removed ‘_s_base-charactersaa’
159+
removed ‘_s_base-charactersab’
160+
removed ‘_s_base-charactersac’
161+
removed ‘_s_base-charactersad’
162+
removed ‘_s_base-charactersae’
163+
removed ‘_s_base-charactersaf’
164+
removed ‘_s_base-charactersag’
165+
removed ‘_s_base-charactersah’
166+
removed ‘_s_base-charactersai’
167+
removed ‘_s_base-charactersaj’
168+
removed ‘UnicodeData.txt’
169+
+ ls -ltr
170+
total 71048
171+
-rw-r--r--. 1 ec2-user ec2-user 1794 Dec 17 01:40 run.sh
172+
-rw-rw-r--. 1 ec2-user ec2-user 72737671 Dec 17 01:40 unicode-14-chars-sorted-glibc-2.17-317.el7.txt
173+
-rw-rw-r--. 1 ec2-user ec2-user 6964 Dec 17 01:40 run.out
174+
+ wc unicode-14-chars-sorted-glibc-2.17-317.el7.txt
175+
4514640 4514730 72737671 unicode-14-chars-sorted-glibc-2.17-317.el7.txt
176+
+ echo 1-1
177+
+ echo 11
178+
+ LC_COLLATE=en_US.UTF-8
179+
+ sort
180+
11
181+
1-1
182+
+ date
183+
Fri Dec 17 01:40:51 UTC 2021

_rhel/ami-005b7876121b7244d/run.sh

+45
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
set -x -e
2+
export LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8
3+
date
4+
cd $(dirname $0)
5+
pwd
6+
which dpkg && dpkg -l libc6 locales
7+
which rpm && rpm -qa|grep -E '(glibc|langpack)'
8+
SOURCE_AMI=$(curl -s http://169.254.169.254/latest/meta-data/ami-id)
9+
OS_VERS=$(cat /etc/issue)
10+
UNICODE_VERS="14"
11+
which dpkg && GLIBC_VERS="$(dpkg -l libc6|awk '/libc6/{print$3}')"
12+
which rpm && GLIBC_VERS="$(rpm -q glibc --queryformat '%{version}-%{release}')"
13+
[ -f /etc/os-release ] && cat /etc/os-release
14+
[ -f /etc/system-release ] && cat /etc/system-release
15+
[ -f /etc/system-release-cpe ] && cat /etc/system-release-cpe
16+
17+
time curl -kO https://www.unicode.org/Public/${UNICODE_VERS}.0.0/ucd/UnicodeData.txt
18+
19+
time perl -naF';' -CO -e'
20+
sub pr3 {printf("%s%08x\n",$_[1],$_[0])}
21+
sub pr2 {pr3($_[0],"B".$_[1]."B");pr3($_[0],"D".$_[1]."D");pr3($_[0],$_[1]);pr3($_[0],$_[1]."B");pr3($_[0],$_[1]."BB");pr3($_[0],$_[1]."D");pr3($_[0],$_[1]."DD")}
22+
sub pr {pr2($_[0],$_[1].chr($_[0]));pr2($_[0],$_[1].chr($_[0]).chr($_[0]));pr3($_[0],"1B".chr($_[0])."B");pr3($_[0],"1B-".chr($_[0])."B")}
23+
if(/<control>/){next}; # skip control characters
24+
if($F[2] eq "Cs"){next}; # skip surrogates
25+
if(/ First>/){$fi=hex("0x".$F[0]);next}; # generate blocks
26+
if(/ Last>/){$la=hex("0x".$F[0]);for($fi..$la){pr($_)};next};
27+
pr(hex("0x".$F[0])) # generate individual characters
28+
' UnicodeData.txt |split -l500000 - _base-characters
29+
30+
wc _base-characters*
31+
32+
locale
33+
34+
date
35+
for FILE in $(ls -1 _base-characters*); do sort $FILE -o _s$FILE & done; jobs; wait
36+
date
37+
38+
time sort -m _s_base-characters* -o unicode-${UNICODE_VERS}-chars-sorted-glibc-${GLIBC_VERS}.txt
39+
40+
rm -v _base-characters* _s_base-characters* UnicodeData.txt
41+
ls -ltr
42+
wc unicode-${UNICODE_VERS}-chars-sorted-glibc-${GLIBC_VERS}.txt
43+
44+
( echo "1-1"; echo "11" ) | LC_COLLATE=en_US.UTF-8 sort
45+
date
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
CodePoint String PositionChange

0 commit comments

Comments
 (0)