Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework string length rules #84

Closed
dpytaylo opened this issue Dec 29, 2023 · 3 comments · Fixed by #88
Closed

Rework string length rules #84

dpytaylo opened this issue Dec 29, 2023 · 3 comments · Fixed by #88

Comments

@dpytaylo
Copy link
Contributor

dpytaylo commented Dec 29, 2023

Hello! Thank you for your crate. I highly appreciate it. However, I've noticed one aspect throughout the entire crate that doesn't resonate with me - specifically, the implementation of the string length rule. In my view, this rule could be misleading because:

It's important to remember that char represents a Unicode Scalar Value, and may not match your idea of what a 'character' is. Iteration over grapheme clusters may be what you actually want.

(Referenced from here)

One reason I admire Rust is its handling of strings: by default, it provides size in bytes, and additionally, you can get the size in Unicode Scalar Values using the .chars().count() methods, which doesn't always(!) match your idea of what a "character" (as mentioned above). It seemed unfamiliar to me that the length rule doesn't return the size in bytes.

Therefore, I would like to suggest a new approach for strings:

  • Move the length (.chars().count()) implementation to the char_count rule
  • Create a new grapheme_count rule that will available with own feature flag because it will require a crate for working with graphemes (e.g. unicode-segmentation).
  • (Optional) Move the byte_length implementation to the length rule

Downsides of the optional solution: it will significantly break backward compatibility.

@jprochazk
Copy link
Owner

Yeah, I agree that the length validator is misleading right now. I figured that byte length wouldn't be a good default, and I didn't want to add a default dependency to unicode-segmentation because it would roughly quadruple the size of the library.

GitHub code search shows that length is one of the most used rules, but it's very often combined with ascii or alphanumeric, so I think changing its meaning to "byte length" at this point would be fine.

I looked into it a bit more, and I believe .chars().count() shouldn't be used at all when validating length in "unicode characters", and I'm not really sure of any cases where you want to validate the number of USVs.

To summarize, I think we should:

  • Change the meaning of length to byte length
  • Deprecate byte_length with a message that tells people to use length instead.
  • Add a grapheme_count rule behind a unicode feature, implemented using unicode-segmentation.
  • Add a char_count rule which will exist for backwards compatibility, just in case someone was depending on it counting USVs.

For most users, this should not cause much if any churn, and for those who explicitly allow unicode in their strings now have the option to have a much more reliable metric for "character count".

@dpytaylo
Copy link
Contributor Author

Yes, that sounds great! At the moment, I'm working on updating the garde dependencies (axum-msgpack, axum-yaml) to support axum 0.7, and after this, I could try to start implementing it.

@dpytaylo dpytaylo mentioned this issue Jan 2, 2024
4 tasks
@dpytaylo
Copy link
Contributor Author

The #86 pull request was closed due to encountered problems.
#88 should address this issue with the new approach, using length modes (e.g. #[garde(length(mode, condition))])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants