Add guidelines on returning string offsets & lengths#521
Add guidelines on returning string offsets & lengths#521mikekistler merged 2 commits intomicrosoft:vNextfrom
Conversation
| "offset": { | ||
| "utf8": 12, | ||
| "utf16": 10, | ||
| "codePoint": 4 |
There was a problem hiding this comment.
nit, we seems got 2 spaces here "codePoint": 4
|
|
||
| ## Returning String Offsets & Lengths (Substrings) | ||
|
|
||
| Some Azure services return substring offset & length values within a string. For example, the offset & length within a string to a name, email address, or phone #. |
There was a problem hiding this comment.
nit phone # seems too informal? Just phone number?
heaths
left a comment
There was a problem hiding this comment.
A few suggestions, but otherwise LGTM.
| | UTF-16 | JavaScript, Java, C# | | ||
| | CodePoint (UTF-32) | Python | | ||
|
|
||
| Because the service doesn't know what language a client is written in and what string encoding that language uses, the service can't return UTF-agnostic offset and length values that the client can use to index within the string. To address this, the service response must include offset & length values for all 3 possible encodings and then the client code must select the encoding it required by its language's internal string encoding. |
There was a problem hiding this comment.
Grammar nit:
| Because the service doesn't know what language a client is written in and what string encoding that language uses, the service can't return UTF-agnostic offset and length values that the client can use to index within the string. To address this, the service response must include offset & length values for all 3 possible encodings and then the client code must select the encoding it required by its language's internal string encoding. | |
| Because the service doesn't know in what language a client is written and what string encoding that language uses, the service can't return UTF-agnostic offset and length values that the client can use to index within the string. To address this, the service response must include offset & length values for all 3 possible encodings and then the client code must select the encoding required by its language's internal string encoding. |
| name := response.fullString[ response.name.offset.utf8 : response.name.offset.utf8 + response.name.length.utf8] | ||
| ``` | ||
|
|
||
| The service must calculate the offset & length for all 3 encodings and return them because clients find it difficult working with Unicode encodings and how to convert from one encoding to another. In other words, we do this to simplify client development and ensure customer success when isolating a substring. |
There was a problem hiding this comment.
Should we also mention that it makes pass-through requests easier as well? That was the thing that really won me over. I think the same was true for @JeffreyRichter, IIRC.
| All string values in JSON are inherently Unicode and UTF-8 encoded, but clients written in a high-level programming language must work with strings in that language's string encoding, which may be UTF-8, UTF-16, or CodePoints (UTF-32). | ||
| When a service response includes a string offset or length value, it should specify these values in all 3 encodings to simplify client development and ensure customer success when isolating a substring. | ||
|
|
||
| <a href="#substrings-return-value-for-each-encoding" name="substrings-return-value-for-each-encoding">:white_check_mark:</a> **DO** include all 3 encodings (UTF-8, UTF-16, and CodePoint) for every string offset or length value in a service response. |
There was a problem hiding this comment.
I think we should document here in this doc the exact format we want e.g., {"utf8": 2, "utf16": 1, "codePoint":1}. We document formats for LROs, pageables, and errors. How you expanded on that in "Considerations" is perfect, but you should also link to that section e.g.,
| <a href="#substrings-return-value-for-each-encoding" name="substrings-return-value-for-each-encoding">:white_check_mark:</a> **DO** include all 3 encodings (UTF-8, UTF-16, and CodePoint) for every string offset or length value in a service response. | |
| <a href="#substrings-return-value-for-each-encoding" name="substrings-return-value-for-each-encoding">:white_check_mark:</a> **DO** include all 3 encodings (UTF-8, UTF-16, and CodePoint) for every string offset or length value in a service response using the schema below. See [considerations](ConsiderationsForServiceDesign.md#{actual-stub-here}) for more information. | |
| ```json | |
| { | |
| "length": { | |
| "utf8": 2, | |
| "utf16": 1, | |
| "codePoint": 1 | |
| } | |
| } | |
| ``` |
This PR splits out the update for string offset and length from #517. I also reworked things a bit by moving the explanatory content over to ConsiderationsForServiceDesign.
It looks like my editor also trimmed some trailing whitespace from otherwise unchanged lines.