Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HHH-14032 Parse Locales with scripts properly in LocaleTypeDescriptor.fromString #3410

Open
wants to merge 1 commit into
base: 5.6
Choose a base branch
from

Conversation

kilink
Copy link

@kilink kilink commented May 19, 2020

Locales with scripts (e.g. zh-Hant-HK) were not parsed properly by LocaleTypeDescriptor,
causing the script part to get lost.

@kilink kilink force-pushed the locales-with-scripts-fix branch from 9db2c0b to 0e858dd Compare May 20, 2020 18:55
return new Locale( string, new String( chars, position, i - position ), new String( chars,
i + 1, chars.length - i - 1 ) );
Locale.Builder builder = new Locale.Builder();
String[] parts = string.split("_");
Copy link
Contributor

@NathanQingyangXu NathanQingyangXu Jun 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From performance perspective, we should avoid using String.split() for it will create RE and might incur unnecessary performance cost. There should be better way to replace the above split() usage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can use the StringHelper#split for better performance.

Copy link
Author

@kilink kilink Jun 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had originally used StringTokenizer, which should have better performance, but its Javadoc says the following:

StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.

That said, even in the latest JDK the class is technically not deprecated, so not sure if the above advice should be strictly heeded.

Looking at the suggested StringHelper#split method, it actually does use StringTokenizer internally. Depending on how careful we want to be about avoiding extra allocations, we can use StringTokenizer directly and avoid the array allocation. I don't have a strong opinion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for knowledge sharing. I have no idea that StringTokenizer is not recommended. I took a look of the internal implementation of String#split() and found it has a fastpath for simple delimiter, so I think your current approach is perfectly fine.
Screen Shot 2020-06-04 at 4 42 41 PM

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The split method accepts a regex which gets compiled at runtime, it would be best to declare the regex as a static final constant in the class, and use that to split strings.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sanne as mentioned in the above comment, the split method has a fast path when splitting on a single character that avoids compiling a regex.

return new Locale( string, new String( chars, position, i - position ), "" );
case 3:
if (s.startsWith("#")) {
s = s.substring(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be ideal to add break for case 3 for best practice. Imagine in the future a new case is needed and it might create a subtle bug that the coder forgets to insert break for case 3.

Copy link
Contributor

@NathanQingyangXu NathanQingyangXu Jun 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gist of this PR seems to lie in the above case 2 and case 3. There is subtlety here that if s is the last token or s starts with # in case 2, it is equivalent to case 3. It would be great to add some comment like // intentionally fall through before case 3 to help future code maintainer to understand the subtlety here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, there must be still room for code readability improvement by clearer logic. The break case falling through is to difficult to understand.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the above code is less than ideal. Adding a break to case 2 would mean duplicating the code from case 3, I don't know which one is the lesser of evils. We could perhaps just avoid the switch altogether and unroll it into 4 if statements.

It could perhaps be rewritten to mimic this pattern as seen in LanguageTag, which would require using something like StringTokenizer. It is unfortunate that this code cannot simply use toLanguageTag / forLanguageTag...

Copy link
Contributor

@NathanQingyangXu NathanQingyangXu Jun 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, using StringTokenizer is perfectly fine in Hibernate core codebase (it has been used more than 50 after a quick search). Code readability is a much bigger concern. The current switch case falling through is a clever approach but might make future code maintaining much more difficult.

@kilink kilink force-pushed the locales-with-scripts-fix branch from 0e858dd to ad42116 Compare June 4, 2020 02:45
assertEquals( toLocale( null, "DE", "ch123", null ), LocaleTypeDescriptor.INSTANCE.fromString( "_DE_ch123" ) );
assertEquals( toLocale( "de", null, "ch123", null ), LocaleTypeDescriptor.INSTANCE.fromString( "de__ch123" ) );
assertEquals( toLocale( "de", "DE", "ch123", null ), LocaleTypeDescriptor.INSTANCE.fromString( "de_DE_ch123" ) );
assertEquals( toLocale( "zh", "HK", null, "Hant"), LocaleTypeDescriptor.INSTANCE.fromString( "zh_HK_#Hant" ) );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the above testing case is the sole motivation of this PR. Could we avoid touching existing code logic by a simple patch as following:

  • test whether the string ends with '#xxx'
  • if so, extract the xxx as script and drop the suffix from the string;
  • pass the string using original logic.

Seems a better alternative than the current tricky switch implementation. Also, it is easy to understand and verify, so is more ideal from code maintaining perspective.

Copy link
Author

@kilink kilink Jan 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing code needs to be rewritten regardless, as the only way to set script or extensions is by switching to LocaleBuilder.

Also, looking at the Javadoc / implementation for Locale.toString(), we can see the pattern is different based on whether the Locale has a variant or not (e.g., with a variant, script and extension are separated from the rest with "_#" vs just the "#" when no variant is present). Therefore I think the approach of testing if the string ends with "#" would need a special case for "_" anyway. I'll explore the more generic approach with just using a StringTokenizer or the like, if the main concern is the switch statement here?

case 2:
if (i < parts.length - 1 || !s.startsWith("#")) {
builder.setVariant(s);
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to adhere to hibernate code format (see https://github.com/hibernate/hibernate-ide-codestyles). Also, existing code convention uses final keyword as much as possible.

// No variant given, we just have language and country
return new Locale( string, new String( chars, position, i - position ), "" );
case 3:
if (s.startsWith("#")) {
Copy link
Member

@Sanne Sanne Aug 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When needing to compare with characters, I'd prefer to use the char API:

string.charAt(0) == '#'

@kilink kilink force-pushed the locales-with-scripts-fix branch from ad42116 to 2f3ae2f Compare January 11, 2021 03:20
@kilink kilink force-pushed the locales-with-scripts-fix branch from 2f3ae2f to 15226dd Compare January 24, 2021 20:10
@kilink
Copy link
Author

kilink commented Jan 24, 2021

Sorry for the delay with this. I've gone with a different approach here that does not use a loop or switch statement at all. One caveat is that I decided not to attempt to parse extensions at all, as that is its own little rabbit hole; this PR strictly adds support for parsing scripts. I am not sure how prevalent bcp47 extensions are in practice, but I am assuming that script is more widely used and important. Extensions could be tackled in a follow up, as the way the code is rewritten would allow it to be extended fairly trivially.

@kilink kilink force-pushed the locales-with-scripts-fix branch from 15226dd to 4bc43bf Compare January 24, 2021 21:12
Base automatically changed from master to main March 19, 2021 16:00
@hibernate-github-bot
Copy link

hibernate-github-bot bot commented Mar 19, 2021

Thanks for your pull request!

This pull request appears to follow the contribution rules.

› This message was automatically generated.

….fromString

Locales with scripts (e.g. zh-Hant-HK) were not parsed properly by LocaleTypeDescriptor,
causing the script part to get lost.
@kilink kilink force-pushed the locales-with-scripts-fix branch from 4bc43bf to 00c71e3 Compare March 24, 2021 17:46
@gavinking
Copy link
Member

I don't like the approach taken here. I think what we should be doing here is offering a whole new way of storing Locales based on the language tag. Unfortunately, it doesn't look like we can do that in a completely backward compatible way, as I commented on the issue. So we'll need some way for the user to opt in to it. That's unfortunate.

Let's discuss it further on the issue.

@gavinking
Copy link
Member

i.e. we should really just be using toLanguageTag() and forLanguageTag() instead of parsing locales in our own code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants