-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Reading units
table is very slow when expanding electrodes
table
#237
Comments
@rly here's the issue we discussed about! |
electrodes
tableunits
table is very slow when expanding electrodes
table
@alejoe91 Is the file available to share via Dandi or elsewhere? If not, we can make a dummy file. |
@alejoe91 I was able to reproduce with my own: |
@alejoe91 Found the source. Exploring ideas to make it quicker. |
@alejoe91 So I found an issue with my toy example. The time is being inflated by some stuff with the spiketimes, which reveals a separate optimization issue to tackle in terms of resolving vector index info. That being said, with that gone, the timing I get does show that zarr takes twice as long, but I do not have the wait time of 10 mins. Rather it is a difference of 4 seconds (zarr) vs 2 seconds (hdf5). I assume this is just the nature of the differences in our files. That being said zarr shouldn't be twice as long. I can continue working on this, but we would need to explore this with you work to make sure the fix is impactful. |
@alejoe91 You mention "copies the entire dynamic table region into the dataframe." It begs the question what would you want to see in the to_dataframe() visual of the units table. Would you want the electrodes column to just be a column of indices to the electrodes table? Would you want this as an arg in to_dataframe to toggle? |
Hi @mavaylon1 Thank you so much for working on this! The new times are perfectly acceptable :) There's already an option not to expand the electrodes, using |
@alejoe91 Is it possible for me to get access to the nwbfile? Those times aren't new times (sadly). I was saying that the example i created to mimic your issue was being inflated in time by other issues that I will make a ticket for. I created a new example to focus on the electrodes, but those are the times I am getting (800 units and 400 electrodes). |
Yep, here's a link to a test file where the issue is quite relevant because it has over 1000 units! It's still uploasing, so give it a few minutes! |
What happened?
We're mainly using NWB-Zarr and found that reading the units table and rendering as a dataframe is prohibitevly slow. This is mainly due to the
electrodes
column that copies the entire dynamic table region into the dataframe.To give some numbers, reading a dataset with ~758 units takse over ~10 minutes. When
index=True
, reading time goes down to ~6s.To investigate this performance issue, we also ran the same tests with the same file saved as HDF5, and here are the results (see steps to reproduce).
In general, Zarr is slower, but this could be due to the fact that everything is compressed by default, with no compression is applied in HDF5.
This barplot shows the reading time for each column in the units table, obtained with:
load_times_hdf5_zarr.pdf
Steps to Reproduce
Traceback
Operating System
Linux
Python Executable
Conda
Python Version
3.9
Package Versions
pynwb 2.8.2
hdmf 3.14.5
hdmf_zarr 0.9.0
Code of Conduct
The text was updated successfully, but these errors were encountered: