-
Notifications
You must be signed in to change notification settings - Fork 22
feat: Support creating timestamp arrays #465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
d6ca76a to
b61a580
Compare
b61a580 to
533b90e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I think there are a few more test cases I'd like to see:
- Passing an array of tz-aware datetimes, each with a separate timezone
- Passing an array of partially tz-aware datetimes, some with a naive timezone
- Passing an array of tz-aware datetimes where the
typepassed has a naive timezone (not sure what happens here, maybe check pyarrow) - Passing an array of tz-aware datetimes where the passed
typehas a tz that is not the same as the datetimes that are passed in - Having partial and full nulls in the input array
Perhaps each of these test cases could also use the pyarrow constructor, so we know we're consistent with them
| .collect() | ||
| } | ||
| None => { | ||
| let vs: Vec<Option<chrono::NaiveDateTime>> = obj.extract()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This says that if the type passed in doesn't have a tz, then ignore any tz passed with the input data. That seems easily wrong if we have a variety of timezones in our array of input timestamps; I'm curious how pyarrow handles this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not ignore it, everything is converted to chrono::NaiveDateTime, so if there is a dt with tz it errors out:
dt = datetime(1999, 8, 7, 11, 12, 13, 141516)
tzinfo = zoneinfo.ZoneInfo('Europe/Madrid')
dt2: datetime = dt.astimezone(timezone(tzinfo.utcoffset(dt)))
arr = Array([dt, None, dt2], type=DataType.timestamp("ms"))
# E TypeError: expected a datetime without tzinfoIn pyarrow it seems to work this way:
Same datetime, one has a tz and array type is tz
dt = datetime(1999, 8, 7, 11, 12, 13, 141516)
tzinfo = zoneinfo.ZoneInfo('Europe/Madrid')
dt2: datetime = dt.astimezone(timezone(tzinfo.utcoffset(dt)))
arr = pyarrow.array([dt, None, dt2], type=pyarrow.timestamp('ms', 'Europe/Madrid'))
print(arr.type)It does not error out:
[
1999-08-07 11:12:13.141Z,
null,
1999-08-07 09:12:13.141Z
]
timestamp[ms, tz=Europe/Madrid]
It applies the given tz to all dates.
Back to python:
[
datetime.datetime(1999, 8, 7, 13, 12, 13, 141000, tzinfo=<DstTzInfo 'Europe/Madrid' CEST+2:00:00 DST>),
None,
datetime.datetime(1999, 8, 7, 11, 12, 13, 141000, tzinfo=<DstTzInfo 'Europe/Madrid' CEST+2:00:00 DST>)
]Both have applied tz.
Same datetime, one has tz and array type does not have tz
[
datetime.datetime(1999, 8, 7, 11, 12, 13, 141516),
None,
datetime.datetime(1999, 8, 7, 9, 12, 13, 141516)
]Both dates are still as is, but tz information is now lost.
So I guess the expected functionality would be to apply the given tz to all dates, if no tz exists just take the dates as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not error out:
[ 1999-08-07 11:12:13.141Z, null, 1999-08-07 09:12:13.141Z ] timestamp[ms, tz=Europe/Madrid]
That feels wrong...? You have two input datetimes that represent the same time. And when they are returned to Python they have the same time. But displayed from Rust they are displayed as different.
Edit: I didn't see that back in Python they are different
pyo3-arrow/src/array.rs
Outdated
| TimeUnit::Nanosecond => { | ||
| let values: Vec<_> = values | ||
| .iter() | ||
| .map(|v| v.unwrap().timestamp_nanos_opt().unwrap()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably the case where nanosecond timestamps overflow and error instead of panic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this again, can this actually ever overflow? DataType.timestamp coerces things into datetime objects:
arr = Array([18_446_744_073_709_551_61, None], type=DataType.timestamp("ns"))
# E TypeError: 'int' object cannot be cast as 'datetime'And datetime objects support max precision of 6 digits (microseconds), so can we actually create an object with precision ns that actually hits that branch? Perhaps we can delete it altogether?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Came to me as I hit comment, we can cast from a u64 and that'll have ns precission.
arr = Array([18_446_744_073_709_551_61, None], type=DataType.uint64()).cast(DataType.timestamp("ns"))
print(arr)[
2028-06-15T09:33:27.370955161,
null,
]There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, it does not panic when there is an overflow:
arr = Array([18_446_744_073_709_551_610, None], type=DataType.uint64()).cast(DataType.timestamp("ns"))
print(arr)It just nulls it:
[
null,
null,
]So what do we do with nano overflow?
Checking pyarrow it seems that they cannot overflow, since casting a uint64 to timestamp is not allowed:
arr = pyarrow.array([18_446_744_073_709_551_61, None], type=pyarrow.uint64()).cast(pyarrow.timestamp("ns"))
print(arr)E pyarrow.lib.ArrowNotImplementedError: Unsupported cast from uint64 to timestamp using function cast_timestamp
pyarrow/error.pxi:92: ArrowNotImplementedError
And if we try coerce an overflow:
dt = datetime(1999, 8, 7, 11, 12, 13, 141516)
arr = pyarrow.array([9_223_372_036_854_775_8070, None], type=pyarrow.int64()).cast(pyarrow.timestamp("ns"))
print(arr)We get an overflow error but that's not time timestampo nano overflowing, but the creation of the int64.
So it's up to us to decide, do we silently null it, trunc it to nano precision or just raise an error?
4d40446 to
d6746af
Compare
d6746af to
ab8b9cc
Compare
| Arc::new(StringViewArray::from(slices)) | ||
| } | ||
| DataType::Timestamp(unit, tz) => { | ||
| // We normalize all datetimes to datetimes in UTC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a more detailed comment here, explaining why this is valid?
Fixes #459
With this PR we now support creating arrays with precission and timezone support: