Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtering Arrow.Primitive is slow #31

Closed
cstjean opened this issue Jul 10, 2018 · 6 comments
Closed

Filtering Arrow.Primitive is slow #31

cstjean opened this issue Jul 10, 2018 · 6 comments

Comments

@cstjean
Copy link

cstjean commented Jul 10, 2018

Perhaps that's the same issue as #30... After loading my dataframe from a feather file, one of the columns is a

julia> typeof(v)
Arrow.Primitive{Float32}

julia> size(v)
(385638,)

Filtering it is 12 times slower than filtering the equivalent Vector.

julia> v2 = collect(v);

julia> keep = rand(Bool, length(v));

julia> @btime v[keep];
  19.494 ms (580386 allocations: 22.66 MiB)

julia> @btime v2[keep];
  1.686 ms (4 allocations: 755.81 KiB)

Is that another fundamental bottleneck? It makes me wonder if I shouldn't simply map(collect, eachcol(df)) after loading.

@davidanthoff
Copy link
Contributor

As far as I understand it the data is only being read from disc while you do the filtering, so I would certainly expect it to be much slower than filtering something that is already in memory.

@cstjean
Copy link
Author

cstjean commented Jul 10, 2018

I can't believe I've worked with feather files for six months and was still unaware of that 😦 Where is that documented? Is this what use_mmap=true does?

Another datapoint:

julia> @btime collect(v)[keep]
  1.893 ms (6 allocations: 2.21 MiB)

@davidanthoff
Copy link
Contributor

That might be a feature of the new, recent release, though, I'm not sure.

@ExpandingMan
Copy link
Owner

ExpandingMan commented Jul 10, 2018

@davidanthoff is right in that you should not in general expect operations on memory mapped arrays to be as fast as on normal arrays stored in RAM. When you access a memory mapped address, it still has to be read from disk (there may be some sort of buffering scheme depending on the OS, I'm fuzzy on the details).

That said, in this particular case I'm slightly puzzled as to why the number of reported allocations is so different. In both cases the resulting array is a copy. Certainly part of the problem is that in the Arrow case all of the pointers need to be computed whereas this operation may be optimized somehow in the Julia Array, but I'm not sure that fully explains it.

Anyway, there will always be an advantage to having your data completely loaded into memory, that'll be true no matter what you use. If you want to copy the entire dataframe into memory you can use Feather.materialize the same way you would use Feather.read.

I do worry there is some unnecessary allocation happening in Arrow related to the pointers. I'll look into this when I get a chance...

@cstjean
Copy link
Author

cstjean commented Jul 10, 2018

Thank you, Feather.materialize only increases load time by 2X, but hopefully it will make downstream operations faster.

@ExpandingMan
Copy link
Owner

While this package may not be super well optimized right now, I don't really see anything actionable from this issue. I'm sure there is significant room for improvement with strings, but it doesn't seem likely to me that Primitive will suddenly become an order of magnitude faster.

I'm going to close this issue unless anybody has any objections (I will leave #30 open).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants