-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In memory specification? #4
Comments
Yes, let's keep an eye towards arrow. What should we do, and what belongs in arrow? Sparse tensor support in arrow is still experimental. It probably makes sense to reach out at some point. Arrow currently has COO, CSR, CSC, and CSF. Notably, COO uses a single array instead of one array per dimension, and it has a "is_canonical" flag which indicates whether the COO indices are sorted lexicographically and has no duplicates. @ivirshup do you have a sense for how much more work it would be to define in-memory specification? Should we limit our scope to in-memory data structures? For example, here is what Arrow has: For the structure: SparseIndex: // Base class
const SparseTensorFormat::type format_id_;
COO:
std::shared_ptr<Tensor> coords_;
bool is_canonical_;
CSR/CSC:
std::shared_ptr<Tensor> indptr_;
std::shared_ptr<Tensor> indices_;
CSF:
std::vector<std::shared_ptr<Tensor>> indptr_;
std::vector<std::shared_ptr<Tensor>> indices_;
std::vector<int64_t> axis_order_; For the tensor (has values): SparseTensor:
std::shared_ptr<DataType> type_;
std::shared_ptr<Buffer> data_;
std::vector<int64_t> shape_;
std::shared_ptr<SparseIndex> sparse_index_;
// These names are optional
std::vector<std::string> dim_names_; so it's pretty straightforward. The constructor signatures and other methods are more interesting. |
I don't. I would note that the sparse tensors definitions here rely on 2d arrays, which also don't have great support. There also isn't a concept of chunked tensors here. |
As another point of data here, the tile-db single cell project is looking at arrow sparse tensors for their interchange interface: single-cell-data/SOMA#32 (comment) |
Another, other point of reference the python array-api is using dlpack for in memory interchange, including interchange between devices. I would like the array-api to also account for sparse arrays, for which densifying is a terrible interchange mechanism. Ideally I'd like the format we're defining here to be used for in-memory array interchange. |
Continuing the discussion here from data-apis/array-api#840 where I proposed exactly this. |
It seems to me like not a lot would be required to support in-memory sparse tensors. Binsparse describes how a sparse matrix is split up across one or more component binary arrays, so as long as you have some cross-platform array storage (e.g. Arrow or dlpack), I think it would work. There might be some related issues that are salient for in-memory formats: e.g., is JSON fast enough for storing metadata? (I really don't know here---it's possible parsing the JSON descriptor is not a bottleneck at all, but for in-memory interchange, which is meant to be zero-copy and thus faster than reading a file, it seems like it might affect performance. Perhaps something like BSON would offer faster parsing of the descriptor?) There are also potentially a wider variety of in-memory formats you might want to support (CSR4, BSR, etc.), but new formats are easy to add. 😃 |
Ben, could you add @hameerabbasi to the meetings? |
Sure thing—@hameerabbasi, could you send your email to [email protected]? |
A few thoughts on the metadata exchange:
|
Since the discussion of the in-memory format has progressed quite a bit; and the concept is accepted by both Python array library maintainers and and binsparse maintainers, I'm reaching out to ask if it's appropriate for the in-memory exchange format to live in the data-apis repo? As far as I can tell, the binsparse spec itself doesn't specify how the backends will store or exchange the specification or the arrays. In addition, for array libraries, the repo is more authoritative than this spec itself. There's an issue with the proposed signatures for in-memory exchange there: data-apis/array-api#840. |
Just realized that we discussed this in our weekly meetings but never added a response here. As we discussed, I think it makes sense for the concrete in-memory exchange format to live in the data-apis repo, since this is analogous to the way things currently work with backends. i.e., we specify the Binsparse format here, while the actual specifics of how this is stored using a container (HDF5, Zarr, or an in-memory one) are left up to the backend. |
I agree that it's fine to move the in-memory spec to the python data-api repo, but I think we need to specify HDF5 and Zarr here in this repo since I don't know where else that would go. |
Good point; we should sit down and specify the HDF5/Zarr conventions in one of our next meetings. (I think we've discussed adding an additional note for each binary container.) |
Do we want to consider an in memory specification for sparse matrices?
In theory, arrow has defined these, but I think they are unsupported in practice. It may be useful to keep an eye towards a definition that could be integrated with arrow.
The text was updated successfully, but these errors were encountered: