-
Notifications
You must be signed in to change notification settings - Fork 89
Description
What would you like to see added to PyNWB?
In a conversation with @satra and @oruebel at the BRAIN meeting, we discussed syntax differences in pynwb that are the result of using different backends. This has the potential to cause bugs and complications for downstream code. I am aware of three areas in PyNWB where differences appear.
-
IO for reading and writing files: A user needs to decide between
NWBHDF5IOandNWBZarrIOin order to read or write a file. The writing part doesn't seem that bad since the user will be required to indicate the backend of choice somehow and using a different class seems like a reasonable way to do that. However, for read you might prefer that pynwb determines the backend automatically. Otherwise, we are going to need to cover the read classes all of the different possible backends in tutorials. IMHO it wouldn't hurt to have anNWBIOclass that can automatically determine the backend. This is a relatively simple solution and would make the pynwb library more user-friendly.
relevant issues: NWBIO #858 -
dataset configuration on write: chunking, compression, filters, shuffling, etc. are specified slightly differently between HDF5/h5py and Zarr. You might consider creating a unifying api layer that translates into the different specs for the backend (@satra was advocating for this) however, there are enough differences in the capabilities and logic of the different approaches that this would not be straightforward. If we do create a unifying language, it would be difficult not to restrict our utilization of configuration capabilities of particular datasets. For example, I'd like to use WavPack, but that is only available with Zarr currently.
-
Dataset indexing when reading: When you read datasets from an NWB file, you get
h5py.Datasetvs.zarr.Dataset. These classes act similarly to anp.ndarray, but there are enough differences that it will likely cause bugs for any analysis script that wants to work on flexible backends. For example, this code works in Zarr:
import zarr
import numpy as np
# Create a new Zarr array
zarr_array = zarr.zeros((10, 10)) # Create a 10x10 array of zeros
zarr_array[:] = np.random.rand(10, 10) # Fill the array with random numbers
# Assume zarr_array is a 2D array, create index arrays
rows = np.array([0, 1, 2])
cols = np.array([1, 2, 3])
# Use multidimensional indexing
subset = zarr_array[rows, cols]
# Print the subset
print(subset)[0.43752435 0.57966441 0.86366265]
but the following analogous code in h5py does not work:
import h5py
import numpy as np
# Create a new HDF5 file and dataset
file = h5py.File('filename.hdf5', 'w')
data = np.random.rand(10, 10) # Create a 10x10 array of random numbers
dataset = file.create_dataset('dataset', data=data)
# Assume dataset is a 2D array, create index arrays
rows = np.array([0, 1, 2])
cols = np.array([1, 2, 3])
# Use multidimensional indexing
subset = dataset[rows, cols]
# Print the subset
print(subset)
# Remember to close the file
file.close()---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In [9], line 10
7 dataset = file.create_dataset('dataset', data=data)
9 # Use multidimensional indexing
---> 10 subset = dataset[rows, cols]
12 # Print the subset
13 print(subset)
File h5py/_objects.pyx:54, in h5py._objects.with_phil.wrapper()
File h5py/_objects.pyx:55, in h5py._objects.with_phil.wrapper()
File ~/opt/miniconda3/lib/python3.9/site-packages/h5py/_hl/dataset.py:814, in Dataset.__getitem__(self, args, new_dtype)
809 return arr
811 # === Everything else ===================
812
813 # Perform the dataspace selection.
--> 814 selection = sel.select(self.shape, args, dataset=self)
816 if selection.nselect == 0:
817 return numpy.zeros(selection.array_shape, dtype=new_dtype)
File ~/opt/miniconda3/lib/python3.9/site-packages/h5py/_hl/selections.py:82, in select(shape, args, dataset)
79 space = h5s.create_simple(shape)
80 selector = _selector.Selector(space)
---> 82 return selector.make_selection(args)
File h5py/_selector.pyx:276, in h5py._selector.Selector.make_selection()
File h5py/_selector.pyx:189, in h5py._selector.Selector.apply_args()
TypeError: Only one indexing vector or array is currently allowed for fancy indexingThere are many other subtle differences between the indexing of these different array-like classes. See the Zarr docs on fancy indexing.
In NWB Widgets, we are starting to run into bugs that are the result of these differences, e.g. NeurodataWithoutBorders/nwbwidgets#283. Without a proper response, I anticipate that these types of issues will accumulate across the cross-section of analysis tools and backends.
Is your feature request related to a problem?
No response
What solution would you like?
There are some pretty big trade-offs here, including homogenization, backwards compatibility, and evaluation of the complexity of each solution. I think it deserves some discussion.
Do you have any interest in helping implement the feature?
Yes.
Code of Conduct
- I agree to follow this project's Code of Conduct
- Have you checked the Contributing document?
- Have you ensured this change was not already requested?