-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove ViewedArray #2329
Comments
That said, I think I might prefer the model of having an For example, we can make assertions that impls are behaving correctly, or implement common behaviour such as bounds checking or short-circuiting functions. struct Array(Arc<dyn ArrayImpl>);
trait ArrayImpl {
fn is_all_valid(&self) -> bool;
}
impl Array {
pub fn is_all_valid(&self) -> bool {
if !self.dtype().is_nullable() {
return true;
}
self.0.is_all_valid()
}
} |
Just to write down the counter-point, in the current design, an Array represents data, and the vtable operates over these uniform array structs. This means the FFI format, the serde format, and the in-memory format can all look very very similar, and have the same VTable operate over all three. In this proposal, the on-disk and FFI formats would represent data, and would have to go through a deserialization step in order to be converted to the opaque pointer for a given array vtable. This means the vtable implementation is tied to the in-memory format. Vs being purely tied to the Array API (for example, I could re-implement a part of an encoding in another language, and operate over the same in-memory pointer). |
If there is performance benefit to ViewedArrays, we could always just |
The idea of a
ViewedArray
was to avoid costly allocations during reads for either deep encoding trees or very wide struct arrays where many of the descendent arrays are never read. We therefore have an array that looks (roughly) like this:In practice, we ended up designing an alternate structure to model lazy access to children: Layouts. These have a smaller VTable (no compute functions), and children are lazily loaded (using async in Rust).
Further, after (after #2247), we will in general have much smaller encoding trees, ~1-10 arrays, rather than ~20-30.
Our current Array requires that all implementations hold the same form, a unit struct wrapper around an Array, and they must access their buffers, metadata, and children via generic APIs. This often means doing repeated work (e.g. downcasting DType to PType for PrimitiveArray). It can also mean many heap allocations, each array holds a vec of children, a vec of buffers, a vec of metadata bytes, an arc of statistics.
The current ViewedArray also isn't ideally laid out. It actually performs all those heap allocations every time a child is accessed!
This proposal is to move back towards a more conventional Rust design, with explicit FFI vtables. Each array can define its own struct that implements an
Array
trait and we pass around anArc<dyn Array>
.We keep the idea of
ArrayData
, that looks like the currentViewedArray
struct. It holds what will become something similar to our FFI format. Each encoding has a singlefrom_array_data(ArrayData) -> Arc<dyn Array>
function. This will recursively load arrays and perform basic validation once when loading from disk.Notice a second benefit which is that arrays can downcast their children into strongly typed trait impls during construction, making compute functions a little cleaner to write.
The text was updated successfully, but these errors were encountered: