-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filtering Arrow.Primitive is slow #31
Comments
As far as I understand it the data is only being read from disc while you do the filtering, so I would certainly expect it to be much slower than filtering something that is already in memory. |
I can't believe I've worked with feather files for six months and was still unaware of that 😦 Where is that documented? Is this what Another datapoint: julia> @btime collect(v)[keep]
1.893 ms (6 allocations: 2.21 MiB) |
That might be a feature of the new, recent release, though, I'm not sure. |
@davidanthoff is right in that you should not in general expect operations on memory mapped arrays to be as fast as on normal arrays stored in RAM. When you access a memory mapped address, it still has to be read from disk (there may be some sort of buffering scheme depending on the OS, I'm fuzzy on the details). That said, in this particular case I'm slightly puzzled as to why the number of reported allocations is so different. In both cases the resulting array is a copy. Certainly part of the problem is that in the Arrow case all of the pointers need to be computed whereas this operation may be optimized somehow in the Julia Anyway, there will always be an advantage to having your data completely loaded into memory, that'll be true no matter what you use. If you want to copy the entire dataframe into memory you can use I do worry there is some unnecessary allocation happening in Arrow related to the pointers. I'll look into this when I get a chance... |
Thank you, |
While this package may not be super well optimized right now, I don't really see anything actionable from this issue. I'm sure there is significant room for improvement with strings, but it doesn't seem likely to me that I'm going to close this issue unless anybody has any objections (I will leave #30 open). |
Perhaps that's the same issue as #30... After loading my dataframe from a feather file, one of the columns is a
Filtering it is 12 times slower than filtering the equivalent
Vector
.Is that another fundamental bottleneck? It makes me wonder if I shouldn't simply
map(collect, eachcol(df))
after loading.The text was updated successfully, but these errors were encountered: