Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove ViewedArray #2329

Open
gatesn opened this issue Feb 12, 2025 · 3 comments
Open

Remove ViewedArray #2329

gatesn opened this issue Feb 12, 2025 · 3 comments

Comments

@gatesn
Copy link
Contributor

gatesn commented Feb 12, 2025

The idea of a ViewedArray was to avoid costly allocations during reads for either deep encoding trees or very wide struct arrays where many of the descendent arrays are never read. We therefore have an array that looks (roughly) like this:

enum Array {
  Owned(OwnedArray), // A tree of heap-allocated Array
  Viewed(ViewedArray), // A single array with a zero-copy buffers and a flatbuffer definition of the tree
}

In practice, we ended up designing an alternate structure to model lazy access to children: Layouts. These have a smaller VTable (no compute functions), and children are lazily loaded (using async in Rust).

Further, after (after #2247), we will in general have much smaller encoding trees, ~1-10 arrays, rather than ~20-30.

Our current Array requires that all implementations hold the same form, a unit struct wrapper around an Array, and they must access their buffers, metadata, and children via generic APIs. This often means doing repeated work (e.g. downcasting DType to PType for PrimitiveArray). It can also mean many heap allocations, each array holds a vec of children, a vec of buffers, a vec of metadata bytes, an arc of statistics.

The current ViewedArray also isn't ideally laid out. It actually performs all those heap allocations every time a child is accessed!


This proposal is to move back towards a more conventional Rust design, with explicit FFI vtables. Each array can define its own struct that implements an Array trait and we pass around an Arc<dyn Array>.

struct PrimitiveArray<T: NativePType> {
  buffer: Buffer<T>,
  stats: StatsSet,
}

trait PrimitiveTypeArray<T: NativePType> {
  ...
}

struct FoRArray<T: FoRPType> {
    child: Arc<dyn PrimitiveArray<T::Unsigned>>,
}

We keep the idea of ArrayData, that looks like the current ViewedArray struct. It holds what will become something similar to our FFI format. Each encoding has a single from_array_data(ArrayData) -> Arc<dyn Array> function. This will recursively load arrays and perform basic validation once when loading from disk.

Notice a second benefit which is that arrays can downcast their children into strongly typed trait impls during construction, making compute functions a little cleaner to write.

@gatesn
Copy link
Contributor Author

gatesn commented Feb 13, 2025

That said, I think I might prefer the model of having an ArrayImpl trait, which allows us to control the exposed API, or "entry point" functions as we currently call them.

For example, we can make assertions that impls are behaving correctly, or implement common behaviour such as bounds checking or short-circuiting functions.

struct Array(Arc<dyn ArrayImpl>);

trait ArrayImpl {
  fn is_all_valid(&self) -> bool;
}

impl Array {
  pub fn is_all_valid(&self) -> bool {
    if !self.dtype().is_nullable() { 
      return true;
    }
    self.0.is_all_valid()
  }
}

@gatesn
Copy link
Contributor Author

gatesn commented Feb 14, 2025

Just to write down the counter-point, in the current design, an Array represents data, and the vtable operates over these uniform array structs. This means the FFI format, the serde format, and the in-memory format can all look very very similar, and have the same VTable operate over all three.

In this proposal, the on-disk and FFI formats would represent data, and would have to go through a deserialization step in order to be converted to the opaque pointer for a given array vtable. This means the vtable implementation is tied to the in-memory format. Vs being purely tied to the Array API (for example, I could re-implement a part of an encoding in another language, and operate over the same in-memory pointer).

@gatesn
Copy link
Contributor Author

gatesn commented Feb 17, 2025

If there is performance benefit to ViewedArrays, we could always just impl ArrayTrait for ViewedArray and defer the deserialization in that way?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant