Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HHH-14032 Parse Locales with scripts properly in LocaleTypeDescriptor.fromString #3410

Open
wants to merge 1 commit into
base: 5.6
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -51,43 +51,33 @@ public Locale fromString(String string) {
return Locale.ROOT;
}

boolean separatorFound = false;
int position = 0;
char[] chars = string.toCharArray();

for ( int i = 0; i < chars.length; i++ ) {
// We just look for separators
if ( chars[i] == '_' ) {
if ( !separatorFound ) {
// On the first separator we know that we have at least a language
string = new String( chars, position, i - position );
position = i + 1;
}
else {
// On the second separator we have to check whether there are more chars available for variant
if ( chars.length > i + 1 ) {
// There is a variant so add it to the constructor
return new Locale( string, new String( chars, position, i - position ), new String( chars,
i + 1, chars.length - i - 1 ) );
Locale.Builder builder = new Locale.Builder();
String[] parts = string.split("_");
Copy link
Contributor

@NathanQingyangXu NathanQingyangXu Jun 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From performance perspective, we should avoid using String.split() for it will create RE and might incur unnecessary performance cost. There should be better way to replace the above split() usage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can use the StringHelper#split for better performance.

Copy link
Author

@kilink kilink Jun 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had originally used StringTokenizer, which should have better performance, but its Javadoc says the following:

StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.

That said, even in the latest JDK the class is technically not deprecated, so not sure if the above advice should be strictly heeded.

Looking at the suggested StringHelper#split method, it actually does use StringTokenizer internally. Depending on how careful we want to be about avoiding extra allocations, we can use StringTokenizer directly and avoid the array allocation. I don't have a strong opinion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for knowledge sharing. I have no idea that StringTokenizer is not recommended. I took a look of the internal implementation of String#split() and found it has a fastpath for simple delimiter, so I think your current approach is perfectly fine.
Screen Shot 2020-06-04 at 4 42 41 PM

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The split method accepts a regex which gets compiled at runtime, it would be best to declare the regex as a static final constant in the class, and use that to split strings.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sanne as mentioned in the above comment, the split method has a fast path when splitting on a single character that avoids compiling a regex.


for (int i = 0; i < parts.length; i++) {
String s = parts[i];
switch (i) {
case 0:
builder.setLanguage(s);
break;
case 1:
builder.setRegion(s);
break;
case 2:
if (i < parts.length - 1 || !s.startsWith("#")) {
builder.setVariant(s);
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to adhere to hibernate code format (see https://github.com/hibernate/hibernate-ide-codestyles). Also, existing code convention uses final keyword as much as possible.

}
else {
// No variant given, we just have language and country
return new Locale( string, new String( chars, position, i - position ), "" );
case 3:
if (s.startsWith("#")) {
Copy link
Member

@Sanne Sanne Aug 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When needing to compare with characters, I'd prefer to use the char API:

string.charAt(0) == '#'

s = s.substring(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be ideal to add break for case 3 for best practice. Imagine in the future a new case is needed and it might create a subtle bug that the coder forgets to insert break for case 3.

Copy link
Contributor

@NathanQingyangXu NathanQingyangXu Jun 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gist of this PR seems to lie in the above case 2 and case 3. There is subtlety here that if s is the last token or s starts with # in case 2, it is equivalent to case 3. It would be great to add some comment like // intentionally fall through before case 3 to help future code maintainer to understand the subtlety here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, there must be still room for code readability improvement by clearer logic. The break case falling through is to difficult to understand.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the above code is less than ideal. Adding a break to case 2 would mean duplicating the code from case 3, I don't know which one is the lesser of evils. We could perhaps just avoid the switch altogether and unroll it into 4 if statements.

It could perhaps be rewritten to mimic this pattern as seen in LanguageTag, which would require using something like StringTokenizer. It is unfortunate that this code cannot simply use toLanguageTag / forLanguageTag...

Copy link
Contributor

@NathanQingyangXu NathanQingyangXu Jun 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, using StringTokenizer is perfectly fine in Hibernate core codebase (it has been used more than 50 after a quick search). Code readability is a much bigger concern. The current switch case falling through is a clever approach but might make future code maintaining much more difficult.

}
}

separatorFound = true;
builder.setScript(s);
break;
}
}

if ( !separatorFound ) {
// No separator found, there is only a language
return new Locale( string );
}
else {
// Only one separator found, there is a language and a country
return new Locale( string, new String( chars, position, chars.length - position ) );
}
return builder.build();
}

@SuppressWarnings({ "unchecked" })
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,18 +25,19 @@ public class LocaleTypeDescriptorTest extends BaseUnitTestCase {

@Test
public void testConversionFromString() {
assertEquals( toLocale( "de", null, null ), LocaleTypeDescriptor.INSTANCE.fromString( "de" ) );
assertEquals( toLocale( "de", "DE", null ), LocaleTypeDescriptor.INSTANCE.fromString( "de_DE" ) );
assertEquals( toLocale( null, "DE", null ), LocaleTypeDescriptor.INSTANCE.fromString( "_DE" ) );
assertEquals( toLocale( null, null, "ch123" ), LocaleTypeDescriptor.INSTANCE.fromString( "__ch123" ) );
assertEquals( toLocale( null, "DE", "ch123" ), LocaleTypeDescriptor.INSTANCE.fromString( "_DE_ch123" ) );
assertEquals( toLocale( "de", null, "ch123" ), LocaleTypeDescriptor.INSTANCE.fromString( "de__ch123" ) );
assertEquals( toLocale( "de", "DE", "ch123" ), LocaleTypeDescriptor.INSTANCE.fromString( "de_DE_ch123" ) );
assertEquals( toLocale( "", "", "" ), LocaleTypeDescriptor.INSTANCE.fromString( "" ) );
assertEquals( toLocale( "de", null, null, null ), LocaleTypeDescriptor.INSTANCE.fromString( "de" ) );
assertEquals( toLocale( "de", "DE", null, null ), LocaleTypeDescriptor.INSTANCE.fromString( "de_DE" ) );
assertEquals( toLocale( null, "DE", null, null ), LocaleTypeDescriptor.INSTANCE.fromString( "_DE" ) );
assertEquals( toLocale( null, null, "ch123", null ), LocaleTypeDescriptor.INSTANCE.fromString( "__ch123" ) );
assertEquals( toLocale( null, "DE", "ch123", null ), LocaleTypeDescriptor.INSTANCE.fromString( "_DE_ch123" ) );
assertEquals( toLocale( "de", null, "ch123", null ), LocaleTypeDescriptor.INSTANCE.fromString( "de__ch123" ) );
assertEquals( toLocale( "de", "DE", "ch123", null ), LocaleTypeDescriptor.INSTANCE.fromString( "de_DE_ch123" ) );
assertEquals( toLocale( "zh", "HK", null, "Hant"), LocaleTypeDescriptor.INSTANCE.fromString( "zh_HK_#Hant" ) );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the above testing case is the sole motivation of this PR. Could we avoid touching existing code logic by a simple patch as following:

  • test whether the string ends with '#xxx'
  • if so, extract the xxx as script and drop the suffix from the string;
  • pass the string using original logic.

Seems a better alternative than the current tricky switch implementation. Also, it is easy to understand and verify, so is more ideal from code maintaining perspective.

Copy link
Author

@kilink kilink Jan 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing code needs to be rewritten regardless, as the only way to set script or extensions is by switching to LocaleBuilder.

Also, looking at the Javadoc / implementation for Locale.toString(), we can see the pattern is different based on whether the Locale has a variant or not (e.g., with a variant, script and extension are separated from the rest with "_#" vs just the "#" when no variant is present). Therefore I think the approach of testing if the string ends with "#" would need a special case for "_" anyway. I'll explore the more generic approach with just using a StringTokenizer or the like, if the main concern is the switch statement here?

assertEquals( toLocale( "", "", "", null ), LocaleTypeDescriptor.INSTANCE.fromString( "" ) );
assertEquals( Locale.ROOT, LocaleTypeDescriptor.INSTANCE.fromString( "" ) );
}

public Locale toLocale(String lang, String region, String variant) {
public Locale toLocale(String lang, String region, String variant, String script) {
final Locale.Builder builder = new Locale.Builder();
if ( StringHelper.isNotEmpty( lang ) ) {
builder.setLanguage( lang );
Expand All @@ -47,6 +48,9 @@ public Locale toLocale(String lang, String region, String variant) {
if ( StringHelper.isNotEmpty( variant ) ) {
builder.setVariant( variant );
}
if ( StringHelper.isNotEmpty( script ) ) {
builder.setScript( script );
}
return builder.build();
}
}