Skip to content

Commit c59f51b

Browse files
yaooqinncloud-fan
authored andcommitted
[SPARK-31879][SQL] Using GB as default Locale for datetime formatters
# What changes were proposed in this pull request? This PR switches the default Locale from the `US` to `GB` to change the behavior of the first day of the week from Sunday-started to Monday-started as same as v2.4 ### Why are the changes needed? #### cases ```sql spark-sql> select to_timestamp('2020-1-1', 'YYYY-w-u'); 2019-12-29 00:00:00 spark-sql> set spark.sql.legacy.timeParserPolicy=legacy; spark.sql.legacy.timeParserPolicy legacy spark-sql> select to_timestamp('2020-1-1', 'YYYY-w-u'); 2019-12-30 00:00:00 ``` #### reasons These week-based fields need Locale to express their semantics, the first day of the week varies from country to country. From the Java doc of WeekFields ```java /** * Gets the first day-of-week. * <p> * The first day-of-week varies by culture. * For example, the US uses Sunday, while France and the ISO-8601 standard use Monday. * This method returns the first day using the standard {code DayOfWeek} enum. * * return the first day-of-week, not null */ public DayOfWeek getFirstDayOfWeek() { return firstDayOfWeek; } ``` But for the SimpleDateFormat, the day-of-week is not localized ``` u Day number of week (1 = Monday, ..., 7 = Sunday) Number 1 ``` Currently, the default locale we use is the US, so the result moved a day backward. For other countries, please refer to [First Day of the Week in Different Countries](http://chartsbin.com/view/41671) With this change, it restores the first day of week calculating for functions when using the default locale. ### Does this PR introduce _any_ user-facing change? Yes, but the behavior change is used to restore the old one of v2.4 ### How was this patch tested? add unit tests Closes apache#28692 from yaooqinn/SPARK-31879. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
1 parent baafd43 commit c59f51b

File tree

6 files changed

+69
-5
lines changed

6 files changed

+69
-5
lines changed

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateFormatter.scala

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -117,7 +117,13 @@ class LegacySimpleDateFormatter(pattern: String, locale: Locale) extends LegacyD
117117
object DateFormatter {
118118
import LegacyDateFormats._
119119

120-
val defaultLocale: Locale = Locale.US
120+
/**
121+
* Before Spark 3.0, the first day-of-week is always Monday. Since Spark 3.0, it depends on the
122+
* locale.
123+
* We pick GB as the default locale instead of US, to be compatible with Spark 2.x, as US locale
124+
* uses Sunday as the first day-of-week. See SPARK-31879.
125+
*/
126+
val defaultLocale: Locale = new Locale("en", "GB")
121127

122128
val defaultPattern: String = "yyyy-MM-dd"
123129

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -278,7 +278,13 @@ object LegacyDateFormats extends Enumeration {
278278
object TimestampFormatter {
279279
import LegacyDateFormats._
280280

281-
val defaultLocale: Locale = Locale.US
281+
/**
282+
* Before Spark 3.0, the first day-of-week is always Monday. Since Spark 3.0, it depends on the
283+
* locale.
284+
* We pick GB as the default locale instead of US, to be compatible with Spark 2.x, as US locale
285+
* uses Sunday as the first day-of-week. See SPARK-31879.
286+
*/
287+
val defaultLocale: Locale = new Locale("en", "GB")
282288

283289
def defaultPattern(): String = s"${DateFormatter.defaultPattern} HH:mm:ss"
284290

sql/core/src/test/resources/sql-tests/inputs/datetime.sql

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -164,3 +164,7 @@ select from_csv('26/October/2015', 'date Date', map('dateFormat', 'dd/MMMMM/yyyy
164164
select from_unixtime(1, 'yyyyyyyyyyy-MM-dd');
165165
select date_format(timestamp '2018-11-17 13:33:33', 'yyyyyyyyyy-MM-dd HH:mm:ss');
166166
select date_format(date '2018-11-17', 'yyyyyyyyyyy-MM-dd');
167+
168+
-- SPARK-31879: the first day of week
169+
select date_format('2020-01-01', 'YYYY-MM-dd uu');
170+
select date_format('2020-01-01', 'YYYY-MM-dd uuuu');

sql/core/src/test/resources/sql-tests/results/ansi/datetime.sql.out

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
-- Automatically generated by SQLQueryTestSuite
2-
-- Number of queries: 119
2+
-- Number of queries: 121
33

44

55
-- !query
@@ -1025,3 +1025,19 @@ struct<>
10251025
-- !query output
10261026
org.apache.spark.SparkUpgradeException
10271027
You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'yyyyyyyyyyy-MM-dd' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
1028+
1029+
1030+
-- !query
1031+
select date_format('2020-01-01', 'YYYY-MM-dd uu')
1032+
-- !query schema
1033+
struct<date_format(CAST(2020-01-01 AS TIMESTAMP), YYYY-MM-dd uu):string>
1034+
-- !query output
1035+
2020-01-01 03
1036+
1037+
1038+
-- !query
1039+
select date_format('2020-01-01', 'YYYY-MM-dd uuuu')
1040+
-- !query schema
1041+
struct<date_format(CAST(2020-01-01 AS TIMESTAMP), YYYY-MM-dd uuuu):string>
1042+
-- !query output
1043+
2020-01-01 Wednesday

sql/core/src/test/resources/sql-tests/results/datetime-legacy.sql.out

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
-- Automatically generated by SQLQueryTestSuite
2-
-- Number of queries: 119
2+
-- Number of queries: 121
33

44

55
-- !query
@@ -980,3 +980,19 @@ select date_format(date '2018-11-17', 'yyyyyyyyyyy-MM-dd')
980980
struct<date_format(CAST(DATE '2018-11-17' AS TIMESTAMP), yyyyyyyyyyy-MM-dd):string>
981981
-- !query output
982982
00000002018-11-17
983+
984+
985+
-- !query
986+
select date_format('2020-01-01', 'YYYY-MM-dd uu')
987+
-- !query schema
988+
struct<date_format(CAST(2020-01-01 AS TIMESTAMP), YYYY-MM-dd uu):string>
989+
-- !query output
990+
2020-01-01 03
991+
992+
993+
-- !query
994+
select date_format('2020-01-01', 'YYYY-MM-dd uuuu')
995+
-- !query schema
996+
struct<date_format(CAST(2020-01-01 AS TIMESTAMP), YYYY-MM-dd uuuu):string>
997+
-- !query output
998+
2020-01-01 0003

sql/core/src/test/resources/sql-tests/results/datetime.sql.out

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
-- Automatically generated by SQLQueryTestSuite
2-
-- Number of queries: 119
2+
-- Number of queries: 121
33

44

55
-- !query
@@ -997,3 +997,19 @@ struct<>
997997
-- !query output
998998
org.apache.spark.SparkUpgradeException
999999
You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'yyyyyyyyyyy-MM-dd' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
1000+
1001+
1002+
-- !query
1003+
select date_format('2020-01-01', 'YYYY-MM-dd uu')
1004+
-- !query schema
1005+
struct<date_format(CAST(2020-01-01 AS TIMESTAMP), YYYY-MM-dd uu):string>
1006+
-- !query output
1007+
2020-01-01 03
1008+
1009+
1010+
-- !query
1011+
select date_format('2020-01-01', 'YYYY-MM-dd uuuu')
1012+
-- !query schema
1013+
struct<date_format(CAST(2020-01-01 AS TIMESTAMP), YYYY-MM-dd uuuu):string>
1014+
-- !query output
1015+
2020-01-01 Wednesday

0 commit comments

Comments
 (0)