Rework string length rules #84

dpytaylo · 2023-12-29T14:49:09Z

Hello! Thank you for your crate. I highly appreciate it. However, I've noticed one aspect throughout the entire crate that doesn't resonate with me - specifically, the implementation of the string length rule. In my view, this rule could be misleading because:

It's important to remember that char represents a Unicode Scalar Value, and may not match your idea of what a 'character' is. Iteration over grapheme clusters may be what you actually want.

(Referenced from here)

One reason I admire Rust is its handling of strings: by default, it provides size in bytes, and additionally, you can get the size in Unicode Scalar Values using the .chars().count() methods, which doesn't always(!) match your idea of what a "character" (as mentioned above). It seemed unfamiliar to me that the length rule doesn't return the size in bytes.

Therefore, I would like to suggest a new approach for strings:

Move the length (.chars().count()) implementation to the char_count rule
Create a new grapheme_count rule that will available with own feature flag because it will require a crate for working with graphemes (e.g. unicode-segmentation).
(Optional) Move the byte_length implementation to the length rule

Downsides of the optional solution: it will significantly break backward compatibility.

The text was updated successfully, but these errors were encountered:

jprochazk · 2023-12-29T19:35:27Z

Yeah, I agree that the length validator is misleading right now. I figured that byte length wouldn't be a good default, and I didn't want to add a default dependency to unicode-segmentation because it would roughly quadruple the size of the library.

GitHub code search shows that length is one of the most used rules, but it's very often combined with ascii or alphanumeric, so I think changing its meaning to "byte length" at this point would be fine.

I looked into it a bit more, and I believe .chars().count() shouldn't be used at all when validating length in "unicode characters", and I'm not really sure of any cases where you want to validate the number of USVs.

To summarize, I think we should:

Change the meaning of length to byte length
Deprecate byte_length with a message that tells people to use length instead.
Add a grapheme_count rule behind a unicode feature, implemented using unicode-segmentation.
Add a char_count rule which will exist for backwards compatibility, just in case someone was depending on it counting USVs.

For most users, this should not cause much if any churn, and for those who explicitly allow unicode in their strings now have the option to have a much more reliable metric for "character count".

dpytaylo · 2023-12-29T22:14:33Z

Yes, that sounds great! At the moment, I'm working on updating the garde dependencies (axum-msgpack, axum-yaml) to support axum 0.7, and after this, I could try to start implementing it.

dpytaylo · 2024-01-21T20:58:46Z

The #86 pull request was closed due to encountered problems.
#88 should address this issue with the new approach, using length modes (e.g. #[garde(length(mode, condition))])

dpytaylo mentioned this issue Jan 2, 2024

Rework string length rules #86

Closed

4 tasks

jprochazk mentioned this issue Jan 27, 2024

Length modes #88

Merged

jprochazk closed this as completed in #88 Jan 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework string length rules #84

Rework string length rules #84

dpytaylo commented Dec 29, 2023 •

edited

Loading

jprochazk commented Dec 29, 2023

dpytaylo commented Dec 29, 2023

dpytaylo commented Jan 21, 2024

Rework string length rules #84

Rework string length rules #84

Comments

dpytaylo commented Dec 29, 2023 • edited Loading

jprochazk commented Dec 29, 2023

dpytaylo commented Dec 29, 2023

dpytaylo commented Jan 21, 2024

dpytaylo commented Dec 29, 2023 •

edited

Loading