Remove ViewedArray #2329

gatesn · 2025-02-12T11:34:43Z

The idea of a ViewedArray was to avoid costly allocations during reads for either deep encoding trees or very wide struct arrays where many of the descendent arrays are never read. We therefore have an array that looks (roughly) like this:

enum Array {
  Owned(OwnedArray), // A tree of heap-allocated Array
  Viewed(ViewedArray), // A single array with a zero-copy buffers and a flatbuffer definition of the tree
}

In practice, we ended up designing an alternate structure to model lazy access to children: Layouts. These have a smaller VTable (no compute functions), and children are lazily loaded (using async in Rust).

Further, after (after #2247), we will in general have much smaller encoding trees, ~1-10 arrays, rather than ~20-30.

Our current Array requires that all implementations hold the same form, a unit struct wrapper around an Array, and they must access their buffers, metadata, and children via generic APIs. This often means doing repeated work (e.g. downcasting DType to PType for PrimitiveArray). It can also mean many heap allocations, each array holds a vec of children, a vec of buffers, a vec of metadata bytes, an arc of statistics.

The current ViewedArray also isn't ideally laid out. It actually performs all those heap allocations every time a child is accessed!

This proposal is to move back towards a more conventional Rust design, with explicit FFI vtables. Each array can define its own struct that implements an Array trait and we pass around an Arc<dyn Array>.

struct PrimitiveArray<T: NativePType> {
  buffer: Buffer<T>,
  stats: StatsSet,
}

trait PrimitiveTypeArray<T: NativePType> {
  ...
}

struct FoRArray<T: FoRPType> {
    child: Arc<dyn PrimitiveArray<T::Unsigned>>,
}

We keep the idea of ArrayData, that looks like the current ViewedArray struct. It holds what will become something similar to our FFI format. Each encoding has a single from_array_data(ArrayData) -> Arc<dyn Array> function. This will recursively load arrays and perform basic validation once when loading from disk.

Notice a second benefit which is that arrays can downcast their children into strongly typed trait impls during construction, making compute functions a little cleaner to write.

The text was updated successfully, but these errors were encountered:

gatesn · 2025-02-13T08:51:54Z

That said, I think I might prefer the model of having an ArrayImpl trait, which allows us to control the exposed API, or "entry point" functions as we currently call them.

For example, we can make assertions that impls are behaving correctly, or implement common behaviour such as bounds checking or short-circuiting functions.

struct Array(Arc<dyn ArrayImpl>);

trait ArrayImpl {
  fn is_all_valid(&self) -> bool;
}

impl Array {
  pub fn is_all_valid(&self) -> bool {
    if !self.dtype().is_nullable() { 
      return true;
    }
    self.0.is_all_valid()
  }
}

gatesn · 2025-02-14T19:57:38Z

Just to write down the counter-point, in the current design, an Array represents data, and the vtable operates over these uniform array structs. This means the FFI format, the serde format, and the in-memory format can all look very very similar, and have the same VTable operate over all three.

In this proposal, the on-disk and FFI formats would represent data, and would have to go through a deserialization step in order to be converted to the opaque pointer for a given array vtable. This means the vtable implementation is tied to the in-memory format. Vs being purely tied to the Array API (for example, I could re-implement a part of an encoding in another language, and operate over the same in-memory pointer).

gatesn · 2025-02-17T08:52:40Z

If there is performance benefit to ViewedArrays, we could always just impl ArrayTrait for ViewedArray and defer the deserialization in that way?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove ViewedArray #2329

Remove ViewedArray #2329

gatesn commented Feb 12, 2025

gatesn commented Feb 13, 2025

gatesn commented Feb 14, 2025

gatesn commented Feb 17, 2025

Remove ViewedArray #2329

Remove ViewedArray #2329

Comments

gatesn commented Feb 12, 2025

gatesn commented Feb 13, 2025

gatesn commented Feb 14, 2025

gatesn commented Feb 17, 2025