-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HHH-14032 Parse Locales with scripts properly in LocaleTypeDescriptor.fromString #3410
base: 5.6
Are you sure you want to change the base?
Conversation
9db2c0b
to
0e858dd
Compare
return new Locale( string, new String( chars, position, i - position ), new String( chars, | ||
i + 1, chars.length - i - 1 ) ); | ||
Locale.Builder builder = new Locale.Builder(); | ||
String[] parts = string.split("_"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From performance perspective, we should avoid using String.split()
for it will create RE and might incur unnecessary performance cost. There should be better way to replace the above split()
usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can use the StringHelper#split
for better performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had originally used StringTokenizer, which should have better performance, but its Javadoc says the following:
StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.
That said, even in the latest JDK the class is technically not deprecated, so not sure if the above advice should be strictly heeded.
Looking at the suggested StringHelper#split
method, it actually does use StringTokenizer internally. Depending on how careful we want to be about avoiding extra allocations, we can use StringTokenizer directly and avoid the array allocation. I don't have a strong opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The split
method accepts a regex which gets compiled at runtime, it would be best to declare the regex as a static final constant in the class, and use that to split strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Sanne as mentioned in the above comment, the split method has a fast path when splitting on a single character that avoids compiling a regex.
return new Locale( string, new String( chars, position, i - position ), "" ); | ||
case 3: | ||
if (s.startsWith("#")) { | ||
s = s.substring(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be ideal to add break
for case 3
for best practice. Imagine in the future a new case is needed and it might create a subtle bug that the coder forgets to insert break
for case 3
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The gist of this PR seems to lie in the above case 2
and case 3
. There is subtlety here that if s is the last token or s starts with #
in case 2
, it is equivalent to case 3
. It would be great to add some comment like // intentionally fall through
before case 3
to help future code maintainer to understand the subtlety here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, there must be still room for code readability improvement by clearer logic. The break case falling through is to difficult to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the above code is less than ideal. Adding a break to case 2 would mean duplicating the code from case 3, I don't know which one is the lesser of evils. We could perhaps just avoid the switch altogether and unroll it into 4 if statements.
It could perhaps be rewritten to mimic this pattern as seen in LanguageTag, which would require using something like StringTokenizer. It is unfortunate that this code cannot simply use toLanguageTag / forLanguageTag...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, using StringTokenizer is perfectly fine in Hibernate core codebase (it has been used more than 50 after a quick search). Code readability is a much bigger concern. The current switch case falling through is a clever approach but might make future code maintaining much more difficult.
...ate-core/src/test/java/org/hibernate/test/type/descriptor/java/LocaleTypeDescriptorTest.java
Outdated
Show resolved
Hide resolved
0e858dd
to
ad42116
Compare
assertEquals( toLocale( null, "DE", "ch123", null ), LocaleTypeDescriptor.INSTANCE.fromString( "_DE_ch123" ) ); | ||
assertEquals( toLocale( "de", null, "ch123", null ), LocaleTypeDescriptor.INSTANCE.fromString( "de__ch123" ) ); | ||
assertEquals( toLocale( "de", "DE", "ch123", null ), LocaleTypeDescriptor.INSTANCE.fromString( "de_DE_ch123" ) ); | ||
assertEquals( toLocale( "zh", "HK", null, "Hant"), LocaleTypeDescriptor.INSTANCE.fromString( "zh_HK_#Hant" ) ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the above testing case is the sole motivation of this PR. Could we avoid touching existing code logic by a simple patch as following:
- test whether the string ends with '#xxx'
- if so, extract the
xxx
asscript
and drop the suffix from the string; - pass the string using original logic.
Seems a better alternative than the current tricky switch implementation. Also, it is easy to understand and verify, so is more ideal from code maintaining perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing code needs to be rewritten regardless, as the only way to set script or extensions is by switching to LocaleBuilder.
Also, looking at the Javadoc / implementation for Locale.toString(), we can see the pattern is different based on whether the Locale has a variant or not (e.g., with a variant, script and extension are separated from the rest with "_#" vs just the "#" when no variant is present). Therefore I think the approach of testing if the string ends with "#" would need a special case for "_" anyway. I'll explore the more generic approach with just using a StringTokenizer or the like, if the main concern is the switch statement here?
case 2: | ||
if (i < parts.length - 1 || !s.startsWith("#")) { | ||
builder.setVariant(s); | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to adhere to hibernate code format (see https://github.com/hibernate/hibernate-ide-codestyles). Also, existing code convention uses final
keyword as much as possible.
// No variant given, we just have language and country | ||
return new Locale( string, new String( chars, position, i - position ), "" ); | ||
case 3: | ||
if (s.startsWith("#")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When needing to compare with characters, I'd prefer to use the char API:
string.charAt(0) == '#'
ad42116
to
2f3ae2f
Compare
2f3ae2f
to
15226dd
Compare
Sorry for the delay with this. I've gone with a different approach here that does not use a loop or switch statement at all. One caveat is that I decided not to attempt to parse extensions at all, as that is its own little rabbit hole; this PR strictly adds support for parsing scripts. I am not sure how prevalent bcp47 extensions are in practice, but I am assuming that script is more widely used and important. Extensions could be tackled in a follow up, as the way the code is rewritten would allow it to be extended fairly trivially. |
15226dd
to
4bc43bf
Compare
Thanks for your pull request! This pull request appears to follow the contribution rules. › This message was automatically generated. |
….fromString Locales with scripts (e.g. zh-Hant-HK) were not parsed properly by LocaleTypeDescriptor, causing the script part to get lost.
4bc43bf
to
00c71e3
Compare
I don't like the approach taken here. I think what we should be doing here is offering a whole new way of storing Let's discuss it further on the issue. |
i.e. we should really just be using |
Locales with scripts (e.g. zh-Hant-HK) were not parsed properly by LocaleTypeDescriptor,
causing the script part to get lost.