Skip to content

Commit 6f92b06

Browse files
lemireUbuntu
and
Ubuntu
authored
Trimming add-across (Neoverse N1) (#49)
* trying to reduce the cost of the 4-byte char * adding results * making some of the arm code prettier --------- Co-authored-by: Ubuntu <[email protected]>
1 parent bd4a55a commit 6f92b06

File tree

2 files changed

+46
-12
lines changed

2 files changed

+46
-12
lines changed

README.md

+33
Original file line numberDiff line numberDiff line change
@@ -178,6 +178,39 @@ boost as the Neoverse V1.
178178
| Latin-Lipsum | 50 | 20 | 2.5 x |
179179
| Russian-Lipsum | 4.0 | 1.2 | 3.3 x |
180180

181+
182+
On a Neoverse N1 (Graviton 2), our validation function is 1.3 to over four times
183+
faster than the standard library.
184+
185+
| data set | SimdUnicode speed (GB/s) | .NET speed (GB/s) | speed up |
186+
|:----------------|:-----------|:--------------------------|:-------------------|
187+
| Twitter.json | 12 | 8.7 | 1.4 x |
188+
| Arabic-Lipsum | 3.4 | 2.0 | 1.7 x |
189+
| Chinese-Lipsum | 3.4 | 2.6 | 1.3 x |
190+
| Emoji-Lipsum | 3.4 | 0.8 | 4.3 x |
191+
| Hebrew-Lipsum | 3.4 | 2.0 | 1.7 x |
192+
| Hindi-Lipsum | 3.4 | 1.6 | 2.1 x |
193+
| Japanese-Lipsum | 3.4 | 2.4  | 1.4 x |
194+
| Korean-Lipsum | 3.4 | 1.3 | 2.6 x |
195+
| Latin-Lipsum | 42 | 17 | 2.5 x |
196+
| Russian-Lipsum | 3.3 | 0.95 | 3.5 x |
197+
198+
On a Neoverse N1 (Graviton 2), our validation function is up to three times
199+
faster than the standard library.
200+
201+
| data set | SimdUnicode speed (GB/s) | .NET speed (GB/s) | speed up |
202+
|:----------------|:-----------|:--------------------------|:-------------------|
203+
| Twitter.json | 7.0 | 5.7 | 1.2 x |
204+
| Arabic-Lipsum | 2.2 | 0.9 | 2.4 x |
205+
| Chinese-Lipsum | 2.1 | 1.8 | 1.1 x |
206+
| Emoji-Lipsum | 1.8 | 0.7 | 2.6 x |
207+
| Hebrew-Lipsum | 2.0 | 0.9 | 2.2 x |
208+
| Hindi-Lipsum | 2.0 | 1.0 | 2.0 x |
209+
| Japanese-Lipsum | 2.1 | 1.7  | 1.2 x |
210+
| Korean-Lipsum | 2.2 | 1.0 | 2.2 x |
211+
| Latin-Lipsum | 24 | 13 | 1.8 x |
212+
| Russian-Lipsum | 2.1 | 0.7 | 3.0 x |
213+
181214
One difficulty with ARM processors is that they have varied SIMD/NEON performance. For example, Neoverse N1 processors, not to be confused with the Neoverse V1 design used by AWS Graviton 3, have weak SIMD performance. Of course, one can pick and choose which approach is best and it is not necessary to apply SimdUnicode is all cases. We expect good performance on recent ARM-based Qualcomm processors.
182215

183216
## Building the library

src/UTF8.cs

+13-12
Original file line numberDiff line numberDiff line change
@@ -1352,6 +1352,8 @@ private unsafe static (int utfadjust, int scalaradjust) calculateErrorPathadjust
13521352
Vector128<byte> fourthByte = Vector128.Create((byte)(0b11110000u - 0x80));
13531353
Vector128<byte> v0f = Vector128.Create((byte)0x0F);
13541354
Vector128<byte> v80 = Vector128.Create((byte)0x80);
1355+
Vector128<byte> fourthByteMinusOne = Vector128.Create((byte)(0b11110000u - 1));
1356+
Vector128<sbyte> largestcont = Vector128.Create((sbyte)-65); // -65 => 0b10111111
13551357
// Performance note: we could process 64 bytes at a time for better speed in some cases.
13561358
int start_point = processedLength;
13571359

@@ -1362,13 +1364,13 @@ private unsafe static (int utfadjust, int scalaradjust) calculateErrorPathadjust
13621364
{
13631365

13641366
Vector128<byte> currentBlock = AdvSimd.LoadVector128(pInputBuffer + processedLength);
1365-
if (AdvSimd.Arm64.MaxAcross(Vector128.AsUInt32(AdvSimd.And(currentBlock, v80))).ToScalar() == 0)
1367+
if ((currentBlock & v80) == Vector128<byte>.Zero)
13661368
// We could also use (AdvSimd.Arm64.MaxAcross(currentBlock).ToScalar() <= 127) but it is slower on some
13671369
// hardware.
13681370
{
13691371
// We have an ASCII block, no need to process it, but
13701372
// we need to check if the previous block was incomplete.
1371-
if (AdvSimd.Arm64.MaxAcross(prevIncomplete).ToScalar() != 0)
1373+
if (prevIncomplete != Vector128<byte>.Zero)
13721374
{
13731375
int off = processedLength >= 3 ? processedLength - 3 : processedLength;
13741376
byte* invalidBytePointer = SimdUnicode.UTF8.SimpleRewindAndValidateWithErrors(16 - 3, pInputBuffer + processedLength - 3, inputLength - processedLength + 3);
@@ -1402,7 +1404,7 @@ private unsafe static (int utfadjust, int scalaradjust) calculateErrorPathadjust
14021404
Vector128<byte> block4 = AdvSimd.LoadVector128(pInputBuffer + processedLength + localasciirun + 48);
14031405
Vector128<byte> or = AdvSimd.Or(AdvSimd.Or(block1, block2), AdvSimd.Or(block3, block4));
14041406

1405-
if (AdvSimd.Arm64.MaxAcross(Vector128.AsUInt32(AdvSimd.And(or, v80))).ToScalar() != 0)
1407+
if ((or & v80) != Vector128<byte>.Zero)
14061408
{
14071409
break;
14081410
}
@@ -1433,7 +1435,7 @@ private unsafe static (int utfadjust, int scalaradjust) calculateErrorPathadjust
14331435
// AdvSimd.Arm64.MaxAcross(error) works, but it might be slower
14341436
// than AdvSimd.Arm64.MaxAcross(Vector128.AsUInt32(error)) on some
14351437
// hardware:
1436-
if (AdvSimd.Arm64.MaxAcross(Vector128.AsUInt32(error)).ToScalar() != 0)
1438+
if (error != Vector128<byte>.Zero)
14371439
{
14381440
byte* invalidBytePointer;
14391441
if (processedLength == 0)
@@ -1457,18 +1459,17 @@ private unsafe static (int utfadjust, int scalaradjust) calculateErrorPathadjust
14571459
return invalidBytePointer;
14581460
}
14591461
prevIncomplete = AdvSimd.SubtractSaturate(currentBlock, maxValue);
1460-
Vector128<sbyte> largestcont = Vector128.Create((sbyte)-65); // -65 => 0b10111111
14611462
contbytes += -AdvSimd.Arm64.AddAcross(AdvSimd.CompareLessThanOrEqual(Vector128.AsSByte(currentBlock), largestcont)).ToScalar();
1462-
1463-
// computing n4 is more expensive than we would like:
1464-
Vector128<byte> fourthByteMinusOne = Vector128.Create((byte)(0b11110000u - 1));
14651463
Vector128<byte> largerthan0f = AdvSimd.CompareGreaterThan(currentBlock, fourthByteMinusOne);
1466-
byte n4add = (byte)AdvSimd.Arm64.AddAcross(largerthan0f).ToScalar();
1467-
int negn4add = (int)(byte)-n4add;
1468-
n4 += negn4add;
1464+
if (largerthan0f != Vector128<byte>.Zero)
1465+
{
1466+
byte n4add = (byte)AdvSimd.Arm64.AddAcross(largerthan0f).ToScalar();
1467+
int negn4add = (int)(byte)-n4add;
1468+
n4 += negn4add;
1469+
}
14691470
}
14701471
}
1471-
bool hasIncompete = AdvSimd.Arm64.MaxAcross(Vector128.AsUInt32(prevIncomplete)).ToScalar() != 0;
1472+
bool hasIncompete = (prevIncomplete != Vector128<byte>.Zero);
14721473
if (processedLength < inputLength || hasIncompete)
14731474
{
14741475
byte* invalidBytePointer;

0 commit comments

Comments
 (0)