Skip to content

Comments

Add guidelines on returning string offsets & lengths#521

Merged
mikekistler merged 2 commits intomicrosoft:vNextfrom
mikekistler:string-index
Feb 4, 2024
Merged

Add guidelines on returning string offsets & lengths#521
mikekistler merged 2 commits intomicrosoft:vNextfrom
mikekistler:string-index

Conversation

@mikekistler
Copy link
Member

This PR splits out the update for string offset and length from #517. I also reworked things a bit by moving the explanatory content over to ConsiderationsForServiceDesign.

It looks like my editor also trimmed some trailing whitespace from otherwise unchanged lines.

"offset": {
"utf8": 12,
"utf16": 10,
      "codePoint": 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, we seems got 2 spaces here "codePoint": 4


## Returning String Offsets & Lengths (Substrings)

Some Azure services return substring offset & length values within a string. For example, the offset & length within a string to a name, email address, or phone #.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit phone # seems too informal? Just phone number?

Copy link
Member

@heaths heaths left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few suggestions, but otherwise LGTM.

| UTF-16 | JavaScript, Java, C# |
| CodePoint (UTF-32) | Python |

Because the service doesn't know what language a client is written in and what string encoding that language uses, the service can't return UTF-agnostic offset and length values that the client can use to index within the string. To address this, the service response must include offset & length values for all 3 possible encodings and then the client code must select the encoding it required by its language's internal string encoding.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar nit:

Suggested change
Because the service doesn't know what language a client is written in and what string encoding that language uses, the service can't return UTF-agnostic offset and length values that the client can use to index within the string. To address this, the service response must include offset & length values for all 3 possible encodings and then the client code must select the encoding it required by its language's internal string encoding.
Because the service doesn't know in what language a client is written and what string encoding that language uses, the service can't return UTF-agnostic offset and length values that the client can use to index within the string. To address this, the service response must include offset & length values for all 3 possible encodings and then the client code must select the encoding required by its language's internal string encoding.

name := response.fullString[ response.name.offset.utf8 : response.name.offset.utf8 + response.name.length.utf8]
```

The service must calculate the offset & length for all 3 encodings and return them because clients find it difficult working with Unicode encodings and how to convert from one encoding to another. In other words, we do this to simplify client development and ensure customer success when isolating a substring.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also mention that it makes pass-through requests easier as well? That was the thing that really won me over. I think the same was true for @JeffreyRichter, IIRC.

All string values in JSON are inherently Unicode and UTF-8 encoded, but clients written in a high-level programming language must work with strings in that language's string encoding, which may be UTF-8, UTF-16, or CodePoints (UTF-32).
When a service response includes a string offset or length value, it should specify these values in all 3 encodings to simplify client development and ensure customer success when isolating a substring.

<a href="#substrings-return-value-for-each-encoding" name="substrings-return-value-for-each-encoding">:white_check_mark:</a> **DO** include all 3 encodings (UTF-8, UTF-16, and CodePoint) for every string offset or length value in a service response.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should document here in this doc the exact format we want e.g., {"utf8": 2, "utf16": 1, "codePoint":1}. We document formats for LROs, pageables, and errors. How you expanded on that in "Considerations" is perfect, but you should also link to that section e.g.,

Suggested change
<a href="#substrings-return-value-for-each-encoding" name="substrings-return-value-for-each-encoding">:white_check_mark:</a> **DO** include all 3 encodings (UTF-8, UTF-16, and CodePoint) for every string offset or length value in a service response.
<a href="#substrings-return-value-for-each-encoding" name="substrings-return-value-for-each-encoding">:white_check_mark:</a> **DO** include all 3 encodings (UTF-8, UTF-16, and CodePoint) for every string offset or length value in a service response using the schema below. See [considerations](ConsiderationsForServiceDesign.md#{actual-stub-here}) for more information.
```json
{
"length": {
"utf8": 2,
"utf16": 1,
"codePoint": 1
}
}
```

@mikekistler mikekistler merged commit e465ea1 into microsoft:vNext Feb 4, 2024
@mikekistler mikekistler deleted the string-index branch February 4, 2024 12:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants