diff --git a/xml/System.Text.Unicode/Utf8.xml b/xml/System.Text.Unicode/Utf8.xml index acd42e273a6..4a19b289320 100644 --- a/xml/System.Text.Unicode/Utf8.xml +++ b/xml/System.Text.Unicode/Utf8.xml @@ -54,7 +54,7 @@ When the method returns, the number of characters read from . When the method returns, the number of bytes written to . - to replace invalid UTF-16 sequences in with U+FFFD; to return if invalid characters are found in . + to replace invalid UTF-16 sequences in with the Unicode replacement character U+FFFD in ; to return if invalid UTF-16 sequences are found in . if the method should not return ; otherwise, . Converts a UTF-16 character span to a UTF-8 encoded byte span. @@ -64,13 +64,247 @@ ## Remarks -This method corresponds to the [UTF8Encoding.GetBytes](xref:System.Text.UTF8Encoding.GetBytes%2A) method, except that it has a different calling convention, different error handling mechanisms, and different performance characteristics. +This method corresponds to the [UTF8Encoding.GetBytes](xref:System.Text.UTF8Encoding.GetBytes%2A) method, except that it uses an -based calling convention and has different error handling mechanisms. -If 'replaceInvalidSequences' is `true`, the method replaces any ill-formed subsequences in `source` with U+FFFD in `destination` and continues processing the remainder of the buffer. Otherwise, the method returns if it encounters any ill-formed sequences. +The following sample shows how to use this to transcode a UTF-16 input buffer to a UTF-8 destination buffer, then from UTF-8 back to a UTF-16 destination buffer. -If the method returns an error code, the out parameters indicate how much of the data was successfully transcoded, and the location of the ill-formed subsequence can be deduced from these values. +```cs +/* + * First, transcode UTF-16 to UTF-8. + */ -If 'replaceInvalidSequences' is `true`, the method never returns . If 'isFinalBlock' is `true`, the method never returns . +Span utf8DestinationBytes = new byte[64]; +string utf16InputChars = "¿Cómo estás?"; // "How are you?" in Spanish +OperationStatus opStatus = Utf8.FromUtf16(utf16InputChars, utf8DestinationBytes, out int charsRead, out int bytesWritten); +Console.WriteLine($"Operation status: {opStatus}"); +Console.WriteLine($"{charsRead} chars read; {bytesWritten} bytes written."); + +if (opStatus != OperationStatus.Done) +{ + throw new Exception("Couldn't convert the entire buffer!"); +} + +Span slicedUtf8Bytes = utf8DestinationBytes.Slice(0, bytesWritten); + +// Prints this output: +// Operation status: Done +// 12 chars read; 15 bytes written. + +/* + * You can also use APIs like Encoding.UTF8 to convert it back from UTF-8 to UTF-16. + */ + +string convertedBackToUtf16 = Encoding.UTF8.GetString(slicedUtf8Bytes); +Console.WriteLine($"Converted back: {convertedBackToUtf16}"); + +// Prints this output: +// Converted back: ¿Cómo estás? +``` + +In this example, the `FromUtf16` method returns because it consumes all 12 input chars and transcodes them to the destination buffer. It reports that it writes 15 bytes to the destination buffer, so the caller must slice the destination buffer to this size before operating on its contents. The remainder of the destination buffer beyond these 15 bytes does not contain useful data. + +> [!NOTE] +> This demonstrates a key concept in UTF-16 to UTF-8 transcoding: the data might expand during the conversion. That is, the number of bytes required in `destination` might be greater than the number of input chars in `source`. +> +> For the `FromUtf16` method, the worst-case expansion is that every input char from `source` might result in 3 bytes being written to `destination`. That is, as long as `destination.Length >= checked(source.Length * 3)` holds, this method will never return . + +### Handling inadequately sized destination buffers + +If the destination buffer is not large enough to hold the transcoded contents of the source buffer, `FromUtf16` returns . The following example demonstrates this scenario. + +```cs +// Intentionally allocate a too-small destination buffer. +Span utf8DestinationBytes = new byte[12]; +string utf16InputChars = "¿Cómo estás?"; // "How are you?" in Spanish +OperationStatus opStatus = Utf8.FromUtf16(utf16InputChars, utf8DestinationBytes, out int charsRead, out int bytesWritten); +Console.WriteLine($"Operation status: {opStatus}"); +Console.WriteLine($"{charsRead} chars read; {bytesWritten} bytes written."); + +// Prints this output: +// Operation status: DestinationTooSmall +// 9 chars read; 11 bytes written. +``` + +In this case, `FromUtf16` was successfully able to transcode the first 9 chars of the input (`"¿Cómo est"`) into 11 bytes and place them in the destination buffer. The last unused byte in the destination buffer does not contain useful data. Transcoding the next character (`'á'`) cannot take place because it would require an additional 2 bytes in the destination buffer, and there is insufficient space available. + +A typical code pattern here is to call this method in a loop, writing it to a stream in chunks. The following example demonstrates this process. + +```cs +MemoryStream outputStream = new MemoryStream(); +string stringToWrite = "Hello world!"; +await WriteStringToStreamAsync(stringToWrite, outputStream); + +async Task WriteStringToStreamAsync(string dataToWrite, Stream outputStream) +{ + // For this example we'll use a 1024-byte scratch buffer, but you can + // use pooled arrays or a differently-sized buffer depending on your + // use cases. As long as the buffer is >= 4 bytes in length, it will + // make forward progress on every iteration. + byte[] scratchBuffer = new byte[1024]; + + ReadOnlyMemory remainingData = dataToWrite.AsMemory(); + while (!remainingData.IsEmpty) + { + OperationStatus opStatus = Utf8.FromUtf16(remainingData.Span, scratchBuffer, out int charsRead, out int bytesWritten); + Debug.Assert(opStatus == OperationStatus.Done || opStatus == OperationStatus.DestinationTooSmall); + Debug.Assert(bytesWritten > 0, "Scratch buffer is too small for loop to make forward progress."); + + await outputStream.WriteAsync(scratchBuffer.AsMemory(0, bytesWritten)); + remainingData = remainingData.Slice(charsRead); + } +} +``` + +### Handling invalid UTF-16 input data + +The `replaceInvalidSequences` argument controls whether `FromUtf16` fixes up invalid UTF-16 sequences in the source buffer. The `replaceInvalidSequences` argument defaults to `true`. This means that by default, any invalid UTF-16 sequences in the source are replaced with the 3-byte UTF-8 sequence `[ EF BF BD ]` in the destination buffer. See for more information. + +The following example demonstrates this replace-by-default behavior. + +```cs +Span utf8DestinationBytes = new byte[128]; +string utf16InputChars = "AB\ud800YZ"; + +OperationStatus opStatus = Utf8.FromUtf16(utf16InputChars, utf8DestinationBytes, out int charsRead, out int bytesWritten); +Console.WriteLine($"Operation status: {opStatus}"); +Console.WriteLine($"{charsRead} chars read; {bytesWritten} bytes written."); + +utf8DestinationBytes = utf8DestinationBytes.Slice(0, bytesWritten); +for (int i = 0; i < utf8DestinationBytes.Length; i++) +{ + Console.WriteLine($"utf8DestinationBytes[{i}] = 0x{utf8DestinationBytes[i]:X2}"); +} + +// Prints this output: +// Operation status: Done +// 5 chars read; 7 bytes written. +// utf8DestinationBytes[0] = 0x41 +// utf8DestinationBytes[1] = 0x42 +// utf8DestinationBytes[2] = 0xEF +// utf8DestinationBytes[3] = 0xBF +// utf8DestinationBytes[4] = 0xBD +// utf8DestinationBytes[5] = 0x59 +// utf8DestinationBytes[6] = 0x5A +``` + +In the output, the leading `"AB"` is successfully transcoded into its UTF-8 representation `[ 41 42 ]`. However, the standalone high surrogate char `'\ud800'` cannot be represented in UTF-8, so the replacement character sequence `[ EF BF BD ]` is written to the destination instead. Finally, the trailing `"YZ"` does transcode successfully to `[ 59 5A ]` and is written to the destination. + +If you set `replaceInvalidSequences` to `false`, substitution of ill-formed input data not take place. Instead, the `ToUtf8` method will stop processing input immediately upon seeing ill-formed input data and return , as shown in the following example. + +```cs +Span utf8DestinationBytes = new byte[128]; +string utf16InputChars = "AB\ud800YZ"; + +OperationStatus opStatus = Utf8.FromUtf16(utf16InputChars, utf8DestinationBytes, out int charsRead, out int bytesWritten, replaceInvalidSequences: false); +Console.WriteLine($"Operation status: {opStatus}"); +Console.WriteLine($"{charsRead} chars read; {bytesWritten} bytes written."); + +utf8DestinationBytes = utf8DestinationBytes.Slice(0, bytesWritten); +for (int i = 0; i < utf8DestinationBytes.Length; i++) +{ + Console.WriteLine($"utf8DestinationBytes[{i}] = 0x{utf8DestinationBytes[i]:X2}"); +} + +// Prints this output: +// Operation status: InvalidData +// 2 chars read; 2 bytes written. +// utf8DestinationBytes[0] = 0x41 +// utf8DestinationBytes[1] = 0x42 +``` + +This demonstrates that the `FromUtf16` method was able to process 2 chars from the input (writing 2 bytes to the destination) before it encountered ill-formed input. The caller may fix up the input, throw an exception, or take any other appropriate action. + +> [!NOTE] +> When `replaceInvalidSequences` is set to its default value of `true`, the `FromUtf16` method will never return . + +### Handling input data split across discontiguous buffers + +The `isFinalBlock` argument controls whether `FromUtf16` treats the entire input as fully self-contained. The `isFinalBlock` argument defaults to `true`. This means that by default, any incomplete UTF-16 data (a standalone high surrogate) at the end of the input buffer is treated as invalid. This will go through `U+FFFD` substitution or cause `FromUtf16` to return depending on the value of the `replaceInvalidSequences` argument, as described earlier. + +The following example demonstrates the default behavior, where both `isFinalBlock` and `replaceInvalidSequences` are `true`. + +```cs +Span utf8DestinationBytes = new byte[128]; +string utf16InputChars = "AB\ud800"; + +OperationStatus opStatus = Utf8.FromUtf16(utf16InputChars, utf8DestinationBytes, out int charsRead, out int bytesWritten); +Console.WriteLine($"Operation status: {opStatus}"); +Console.WriteLine($"{charsRead} chars read; {bytesWritten} bytes written."); + +utf8DestinationBytes = utf8DestinationBytes.Slice(0, bytesWritten); +for (int i = 0; i < utf8DestinationBytes.Length; i++) +{ + Console.WriteLine($"utf8DestinationBytes[{i}] = 0x{utf8DestinationBytes[i]:X2}"); +} + +// Prints this output: +// Operation status: Done +// 3 chars read; 5 bytes written. +// utf8DestinationBytes[0] = 0x41 +// utf8DestinationBytes[1] = 0x42 +// utf8DestinationBytes[2] = 0xEF +// utf8DestinationBytes[3] = 0xBF +// utf8DestinationBytes[4] = 0xBD +``` + +In the output, the leading `"AB"` is successfully transcoded to its UTF-8 representation `[ 41 42 ]`. There's a standalone high surrogate char at the end of the input. However, since `isFinalBlock` defaults to `true`, this indicates to the `FromUtf16` method that there's no more data in the input - no future call will supply the matching low surrogate char. `FromUtf16` then treats this as an invalid UTF-16 sequence, and since `replaceInvalidSequences` also defaults to `true`, the substitution sequence `[ EF BF BD ]` is written to the destination. + +If `isFinalBlock` keeps its default value of `true` but `replaceInvalidSequences` is set to `false`, then a standalone high surrogate char at the end of the input will cause `FromUtf16` to return , as the following example shows. + +```cs +Span utf8DestinationBytes = new byte[128]; +string utf16InputChars = "AB\ud800"; + +OperationStatus opStatus = Utf8.FromUtf16(utf16InputChars, utf8DestinationBytes, out int charsRead, out int bytesWritten, replaceInvalidSequences: false); +Console.WriteLine($"Operation status: {opStatus}"); +Console.WriteLine($"{charsRead} chars read; {bytesWritten} bytes written."); + +utf8DestinationBytes = utf8DestinationBytes.Slice(0, bytesWritten); +for (int i = 0; i < utf8DestinationBytes.Length; i++) +{ + Console.WriteLine($"utf8DestinationBytes[{i}] = 0x{utf8DestinationBytes[i]:X2}"); +} + +// Prints this output: +// Operation status: InvalidData +// 2 chars read; 2 bytes written. +// utf8DestinationBytes[0] = 0x41 +// utf8DestinationBytes[1] = 0x42 +``` + +This demonstrates that the `FromUtf16` method was able to process 2 chars from the input (writing 2 bytes to the destination) before it encountered the standalone high surrogate char at the end of the input. The caller may fix up the input, throw an exception, or take any other appropriate action. + +Sometimes the application doesn't have all of the input text in a single contiguous buffer. Perhaps the app is dealing with gigantic documents, and it would rather represent this data through an array of buffers (a `char[][]`, perhaps) instead of a single giant `string` instance. + +If `isFinalBlock` is set to `false`, this tells `FromUtf16` that the input argument doesn't represent the entirety of the remaining data. The `FromUtf16` method shouldn't treat a high surrogate char at the end of the input as invalid, as the next portion of the buffer could begin with a matching low surrogate char. In this case, the method returns , as the following example shows. + +```cs +Span utf8DestinationBytes = new byte[128]; +string utf16InputChars = "AB\ud800"; + +OperationStatus opStatus = Utf8.FromUtf16(utf16InputChars, utf8DestinationBytes, out int charsRead, out int bytesWritten, isFinalBlock: false); +Console.WriteLine($"Operation status: {opStatus}"); +Console.WriteLine($"{charsRead} chars read; {bytesWritten} bytes written."); + +utf8DestinationBytes = utf8DestinationBytes.Slice(0, bytesWritten); +for (int i = 0; i < utf8DestinationBytes.Length; i++) +{ + Console.WriteLine($"utf8DestinationBytes[{i}] = 0x{utf8DestinationBytes[i]:X2}"); +} + +// Prints this output: +// Operation status: NeedMoreData +// 2 chars read; 2 bytes written. +// utf8DestinationBytes[0] = 0x41 +// utf8DestinationBytes[1] = 0x42 +``` + +In this example, `FromUtf16` was able to process 2 input chars and generate 2 output bytes, but then it encountered partial data (a standalone high surrogate) at the end of the input buffer. More input is needed before a determination can be made as to whether this data is valid or invalid. + +> [!NOTE] +> The `FromUtf16` method is stateless, meaning it does not keep track of input buffer contents between calls. If this method returns , it is up to the caller to stitch together the remainder of the current input buffer with the contents of the next input buffer before calling `FromUtf16` again. +> +> When `isFinalBlock` is set to its default value of `true`, the `FromUtf16` method will never return . ]]> diff --git a/xml/System.Text/Rune.xml b/xml/System.Text/Rune.xml index 6095a96ddd0..73086d2b235 100644 --- a/xml/System.Text/Rune.xml +++ b/xml/System.Text/Rune.xml @@ -452,7 +452,88 @@ For similar types in other programming languages, see [Rust's primitive `char` t ## Remarks -The general convention is to call this method in a loop, slicing the `source` buffer by `charsConsumed` elements on each iteration of the loop. On each iteration of the loop, `result` contains the real scalar value if the data was successfully decoded, or it contains if the data was not successfully decoded. This pattern provides convenient automatic U+FFFD substitution of invalid sequences while iterating through the loop. +The general convention is to call this method in a loop, slicing the `source` buffer by `charsConsumed` elements on each iteration of the loop. On each iteration of the loop, `result` contains the real scalar value if successfully decoded, or it contains if the data could not be successfully decoded. This pattern provides convenient automatic `U+FFFD` substitution of invalid sequences while iterating through the loop. + +> [!CAUTION] +> When calling this method in a loop and slicing the `source` span, use the returned `charsConsumed` value instead of the returned `result`'s property. +> +> While these two values will be identical for UTF-16 scenarios, they are not guaranteed to be identical for UTF-8 scenarios. This could cause subtle bugs in applications which initially call `DecodeFromUtf16` but which are refactored to eventually call `DecodeFromUtf8`. Using `charsConsumed` as an argument to the slice routine helps avoid this pitfall. See the Remarks section in for more information. + +The following sample demonstrates calling this method in a loop, printing each present in the input. + +```cs +using System.Buffers; +using System.Text; + +char[] fullInput = new char[] +{ + '\u0050', // U+0050 LATIN CAPITAL LETTER P + '\uD83D', '\uDC36', // U+1F436 DOG FACE + '\u00E4', // U+00E4 LATIN SMALL LETTER A WITH DIAERESIS + '\uD800', // + '\u00C0', // U+00C0 LATIN CAPITAL LETTER A WITH GRAVE + '\uDFFF', // + '\u2673', // U+2673 RECYCLING SYMBOL FOR TYPE-1 PLASTICS +}; + +ReadOnlySpan remainingInput = fullInput; +while (!remainingInput.IsEmpty) +{ + // Decode + OperationStatus opStatus = Rune.DecodeFromUtf16(remainingInput, out Rune result, out int charsConsumed); + + // Print information + Console.WriteLine($"Read {charsConsumed} char(s): {CharsToEscapedString(remainingInput.Slice(0, charsConsumed))}"); + Console.WriteLine($">> OperationStatus = {opStatus}"); + Console.WriteLine($">> Rune = U+{result.Value:X4} ({result})"); + Console.WriteLine(); + + // Slice and loop again + remainingInput = remainingInput.Slice(charsConsumed); +} + +string CharsToEscapedString(ReadOnlySpan chars) +{ + StringBuilder builder = new StringBuilder(); + builder.Append("[ "); + foreach (char ch in chars) + { + builder.Append($"'\\u{(int)ch:X4}' "); + } + builder.Append(']'); + return builder.ToString(); +} + +// Prints: +// +// Read 1 char(s): [ '\u0050' ] +// >> OperationStatus = Done +// >> Rune = U+0050 (P) +// +// Read 2 char(s): [ '\uD83D' '\uDC36' ] +// >> OperationStatus = Done +// >> Rune = U+1F436 (🐶) +// +// Read 1 char(s): [ '\u00E4' ] +// >> OperationStatus = Done +// >> Rune = U+00E4 (ä) +// +// Read 1 char(s): [ '\uD800' ] +// >> OperationStatus = InvalidData +// >> Rune = U+FFFD (�) +// +// Read 1 char(s): [ '\u00C0' ] +// >> OperationStatus = Done +// >> Rune = U+00C0 (À) +// +// Read 1 char(s): [ '\uDFFF' ] +// >> OperationStatus = InvalidData +// >> Rune = U+FFFD (�) +// +// Read 1 char(s): [ '\u2673' ] +// >> OperationStatus = Done +// >> Rune = U+2673 (♳) +``` ]]> @@ -499,7 +580,88 @@ The general convention is to call this method in a loop, slicing the `source` bu ## Remarks -The general convention is to call this method in a loop, slicing the `source` buffer by `bytesConsumed` elements on each iteration of the loop. On each iteration of the loop, `result` contains the real scalar value if successfully decoded, or it contains if the data could not be successfully decoded. This pattern provides convenient automatic U+FFFD substitution of invalid sequences while iterating through the loop. +The general convention is to call this method in a loop, slicing the `source` buffer by `bytesConsumed` elements on each iteration of the loop. On each iteration of the loop, `result` contains the real scalar value if successfully decoded, or it contains if the data could not be successfully decoded. This pattern provides convenient automatic `U+FFFD` substitution of invalid sequences while iterating through the loop. + +> [!CAUTION] +> When calling this method in a loop and slicing the `source` span, use the returned `bytesConsumed` value instead of the returned `result`'s property. +> +> This is because invalid UTF-8 sequences are on-the-fly substituted with , and the replacement character's property always returns `3`, corresponding to the replacement sequence `[ EF BF BD ]`. However, the invalid UTF-8 byte sequence as present in the input span can be anywhere from 1 - 3 bytes in length. The returned `bytesConsumed` value will always contain the actual number of bytes consumed from the input span. + +The following sample demonstrates calling this method in a loop, printing each present in the input. + +```cs +using System.Buffers; +using System.Text; + +byte[] fullInput = new byte[] +{ + 0x50, // U+0050 LATIN CAPITAL LETTER P + 0xF0, 0x9F, 0x90, 0xB6, // U+1F436 DOG FACE + 0xC3, 0xA4, // U+00E4 LATIN SMALL LETTER A WITH DIAERESIS + 0xFF, // + 0xC3, 0x80, // U+00C0 LATIN CAPITAL LETTER A WITH GRAVE + 0xE1, 0x80, // + 0xE2, 0x99, 0xB3 // U+2673 RECYCLING SYMBOL FOR TYPE-1 PLASTICS +}; + +ReadOnlySpan remainingInput = fullInput; +while (!remainingInput.IsEmpty) +{ + // Decode + OperationStatus opStatus = Rune.DecodeFromUtf8(remainingInput, out Rune result, out int bytesConsumed); + + // Print information + Console.WriteLine($"Read {bytesConsumed} byte(s): {BytesToHexString(remainingInput.Slice(0, bytesConsumed))}"); + Console.WriteLine($">> OperationStatus = {opStatus}"); + Console.WriteLine($">> Rune = U+{result.Value:X4} ({result})"); + Console.WriteLine(); + + // Slice and loop again + remainingInput = remainingInput.Slice(bytesConsumed); +} + +string BytesToHexString(ReadOnlySpan bytes) +{ + StringBuilder builder = new StringBuilder(); + builder.Append("[ "); + foreach (byte b in bytes) + { + builder.Append($"{b:X2} "); + } + builder.Append(']'); + return builder.ToString(); +} + +// Prints: +// +// Read 1 byte(s): [ 50 ] +// >> OperationStatus = Done +// >> Rune = U+0050 (P) +// +// Read 4 byte(s): [ F0 9F 90 B6 ] +// >> OperationStatus = Done +// >> Rune = U+1F436 (🐶) +// +// Read 2 byte(s): [ C3 A4 ] +// >> OperationStatus = Done +// >> Rune = U+00E4 (ä) +// +// Read 1 byte(s): [ FF ] +// >> OperationStatus = InvalidData +// >> Rune = U+FFFD (�) +// +// Read 2 byte(s): [ C3 80 ] +// >> OperationStatus = Done +// >> Rune = U+00C0 (À) +// +// Read 2 byte(s): [ E1 80 ] +// >> OperationStatus = InvalidData +// >> Rune = U+FFFD (�) +// +// Read 3 byte(s): [ E2 99 B3 ] +// >> OperationStatus = Done +// >> Rune = U+2673 (♳) +``` ]]> @@ -546,7 +708,90 @@ The general convention is to call this method in a loop, slicing the `source` bu ## Remarks -This method is very similar to , except it allows the caller to loop backward instead of forward. The typical calling convention is that on each iteration of the loop, the caller should slice off the final `charsConsumed` elements of the `source` buffer. +This method is very similar to , except it allows the caller to loop backward instead of forward. + +The general convention is to call this method in a loop, slicing the `source` buffer by `charsConsumed` elements on each iteration of the loop. On each iteration of the loop, `result` contains the real scalar value if successfully decoded, or it contains if the data could not be successfully decoded. This pattern provides convenient automatic `U+FFFD` substitution of invalid sequences while iterating through the loop. + +> [!CAUTION] +> When calling this method in a loop and slicing the `source` span, use the returned `charsConsumed` value instead of the returned `result`'s property. +> +> While these two values will be identical for UTF-16 scenarios, they are not guaranteed to be identical for UTF-8 scenarios. This could cause subtle bugs in applications which initially call `DecodeLastFromUtf16` but which are refactored to eventually call `DecodeLastFromUtf8`. Using `charsConsumed` as an argument to the slice routine helps avoid this pitfall. See the Remarks section in for more information. + +The following sample demonstrates calling this method in a loop, printing each present in the input. + +```cs +using System.Buffers; +using System.Text; + +char[] fullInput = new char[] +{ + '\u0050', // U+0050 LATIN CAPITAL LETTER P + '\uD83D', '\uDC36', // U+1F436 DOG FACE + '\u00E4', // U+00E4 LATIN SMALL LETTER A WITH DIAERESIS + '\uD800', // + '\u00C0', // U+00C0 LATIN CAPITAL LETTER A WITH GRAVE + '\uDFFF', // + '\u2673', // U+2673 RECYCLING SYMBOL FOR TYPE-1 PLASTICS +}; + +ReadOnlySpan remainingInput = fullInput; +while (!remainingInput.IsEmpty) +{ + // Decode + OperationStatus opStatus = Rune.DecodeLastFromUtf16(remainingInput, out Rune result, out int charsConsumed); + + // Print information + Console.WriteLine($"Read {charsConsumed} char(s): {CharsToEscapedString(remainingInput[^charsConsumed..])}"); + Console.WriteLine($">> OperationStatus = {opStatus}"); + Console.WriteLine($">> Rune = U+{result.Value:X4} ({result})"); + Console.WriteLine(); + + // Slice and loop again + remainingInput = remainingInput[..^charsConsumed]; +} + +string CharsToEscapedString(ReadOnlySpan chars) +{ + StringBuilder builder = new StringBuilder(); + builder.Append("[ "); + foreach (char ch in chars) + { + builder.Append($"'\\u{(int)ch:X4}' "); + } + builder.Append(']'); + return builder.ToString(); +} + +// Prints: +// +// Read 1 char(s): [ '\u2673' ] +// >> OperationStatus = Done +// >> Rune = U+2673 (♳) +// +// Read 1 char(s): [ '\uDFFF' ] +// >> OperationStatus = InvalidData +// >> Rune = U+FFFD (�) +// +// Read 1 char(s): [ '\u00C0' ] +// >> OperationStatus = Done +// >> Rune = U+00C0 (À) +// +// Read 1 char(s): [ '\uD800' ] +// >> OperationStatus = NeedMoreData +// >> Rune = U+FFFD (�) +// +// Read 1 char(s): [ '\u00E4' ] +// >> OperationStatus = Done +// >> Rune = U+00E4 (ä) +// +// Read 2 char(s): [ '\uD83D' '\uDC36' ] +// >> OperationStatus = Done +// >> Rune = U+1F436 (🐶) +// +// Read 1 char(s): [ '\u0050' ] +// >> OperationStatus = Done +// >> Rune = U+0050 (P) +``` ]]> @@ -593,7 +838,90 @@ This method is very similar to , except it allows the caller to loop backward instead of forward. The typical calling convention is that on each iteration of the loop, the caller should slice off the final `bytesConsumed` elements of the `source` buffer. +This method is very similar to , except it allows the caller to loop backward instead of forward. + +The general convention is to call this method in a loop, slicing the `source` buffer by `bytesConsumed` elements on each iteration of the loop. On each iteration of the loop, `result` contains the real scalar value if successfully decoded, or it contains if the data could not be successfully decoded. This pattern provides convenient automatic `U+FFFD` substitution of invalid sequences while iterating through the loop. + +> [!CAUTION] +> When calling this method in a loop and slicing the `source` span, use the returned `bytesConsumed` value instead of the returned `result`'s property. +> +> This is because invalid UTF-8 sequences are on-the-fly substituted with , and the replacement character's property always returns `3`, corresponding to the replacement sequence `[ EF BF BD ]`. However, the invalid UTF-8 byte sequence as present in the input span can be anywhere from 1 - 3 bytes in length. The returned `bytesConsumed` value will always contain the actual number of bytes consumed from the input span. + +The following sample demonstrates calling this method in a loop, printing each present in the input. + +```cs +using System.Buffers; +using System.Text; + +byte[] fullInput = new byte[] +{ + 0x50, // U+0050 LATIN CAPITAL LETTER P + 0xF0, 0x9F, 0x90, 0xB6, // U+1F436 DOG FACE + 0xC3, 0xA4, // U+00E4 LATIN SMALL LETTER A WITH DIAERESIS + 0xFF, // + 0xC3, 0x80, // U+00C0 LATIN CAPITAL LETTER A WITH GRAVE + 0xE1, 0x80, // + 0xE2, 0x99, 0xB3 // U+2673 RECYCLING SYMBOL FOR TYPE-1 PLASTICS +}; + +ReadOnlySpan remainingInput = fullInput; +while (!remainingInput.IsEmpty) +{ + // Decode + OperationStatus opStatus = Rune.DecodeLastFromUtf8(remainingInput, out Rune result, out int bytesConsumed); + + // Print information + Console.WriteLine($"Read {bytesConsumed} byte(s): {BytesToHexString(remainingInput[^bytesConsumed..])}"); + Console.WriteLine($">> OperationStatus = {opStatus}"); + Console.WriteLine($">> Rune = U+{result.Value:X4} ({result})"); + Console.WriteLine(); + + // Slice and loop again + remainingInput = remainingInput[..^bytesConsumed]; +} + +string BytesToHexString(ReadOnlySpan bytes) +{ + StringBuilder builder = new StringBuilder(); + builder.Append("[ "); + foreach (byte b in bytes) + { + builder.Append($"{b:X2} "); + } + builder.Append(']'); + return builder.ToString(); +} + +// Prints: +// +// Read 3 byte(s): [ E2 99 B3 ] +// >> OperationStatus = Done +// >> Rune = U+2673 (♳) +// +// Read 2 byte(s): [ E1 80 ] +// >> OperationStatus = NeedMoreData +// >> Rune = U+FFFD (�) +// +// Read 2 byte(s): [ C3 80 ] +// >> OperationStatus = Done +// >> Rune = U+00C0 (À) +// +// Read 1 byte(s): [ FF ] +// >> OperationStatus = InvalidData +// >> Rune = U+FFFD (�) +// +// Read 2 byte(s): [ C3 A4 ] +// >> OperationStatus = Done +// >> Rune = U+00E4 (ä) +// +// Read 4 byte(s): [ F0 9F 90 B6 ] +// >> OperationStatus = Done +// >> Rune = U+1F436 (🐶) +// +// Read 1 byte(s): [ 50 ] +// >> OperationStatus = Done +// >> Rune = U+0050 (P) +``` ]]> @@ -1778,7 +2106,76 @@ For more information, see Gets a instance that represents the Unicode replacement character U+FFFD. A instance that represents the Unicode replacement character U+FFFD. - To be added. + + [!NOTE] +> When a substitution occurs, it results in data loss. For example, _both_ the invalid UTF-8 sequence `[ 41 42 C0 59 5A ]` _and_ the invalid UTF-8 sequence `[ 41 42 C1 59 5A ]` will, upon transcoding to UTF-16 with substitution, result in the exact same output string `"AB\ufffdYZ"`. There is no mechanism to reverse this and recover the original invalid bytes. +> +> The mere presence of `'\uFFFD'` in a UTF-16 string or `[ EF BF BD ]` in a UTF-8 string should not be construed as evidence that the original input was ill-formed. It's perfectly legal for an input UTF-16 string to already contain the replacement character `'\uFFFD'` or for an input UTF-8 string to already contain `[ EF BF BD ]`, and they will freely and losslessly convert between each other. + +For more information on use of `U+FFFD` as a substitution character for ill-formed input sequences, see [the Unicode Standard, Ch. 3](https://www.unicode.org/versions/latest/ch03.pdf). + + ]]> +