Filtering Arrow.Primitive is slow #31

cstjean · 2018-07-10T12:11:58Z

Perhaps that's the same issue as #30... After loading my dataframe from a feather file, one of the columns is a

julia> typeof(v)
Arrow.Primitive{Float32}

julia> size(v)
(385638,)

Filtering it is 12 times slower than filtering the equivalent Vector.

julia> v2 = collect(v);

julia> keep = rand(Bool, length(v));

julia> @btime v[keep];
  19.494 ms (580386 allocations: 22.66 MiB)

julia> @btime v2[keep];
  1.686 ms (4 allocations: 755.81 KiB)

Is that another fundamental bottleneck? It makes me wonder if I shouldn't simply map(collect, eachcol(df)) after loading.

The text was updated successfully, but these errors were encountered:

davidanthoff · 2018-07-10T12:21:49Z

As far as I understand it the data is only being read from disc while you do the filtering, so I would certainly expect it to be much slower than filtering something that is already in memory.

cstjean · 2018-07-10T12:37:59Z

I can't believe I've worked with feather files for six months and was still unaware of that 😦 Where is that documented? Is this what use_mmap=true does?

Another datapoint:

julia> @btime collect(v)[keep]
  1.893 ms (6 allocations: 2.21 MiB)

davidanthoff · 2018-07-10T12:42:45Z

That might be a feature of the new, recent release, though, I'm not sure.

ExpandingMan · 2018-07-10T13:12:31Z

@davidanthoff is right in that you should not in general expect operations on memory mapped arrays to be as fast as on normal arrays stored in RAM. When you access a memory mapped address, it still has to be read from disk (there may be some sort of buffering scheme depending on the OS, I'm fuzzy on the details).

That said, in this particular case I'm slightly puzzled as to why the number of reported allocations is so different. In both cases the resulting array is a copy. Certainly part of the problem is that in the Arrow case all of the pointers need to be computed whereas this operation may be optimized somehow in the Julia Array, but I'm not sure that fully explains it.

Anyway, there will always be an advantage to having your data completely loaded into memory, that'll be true no matter what you use. If you want to copy the entire dataframe into memory you can use Feather.materialize the same way you would use Feather.read.

I do worry there is some unnecessary allocation happening in Arrow related to the pointers. I'll look into this when I get a chance...

cstjean · 2018-07-10T14:14:46Z

Thank you, Feather.materialize only increases load time by 2X, but hopefully it will make downstream operations faster.

ExpandingMan · 2018-09-13T17:47:46Z

While this package may not be super well optimized right now, I don't really see anything actionable from this issue. I'm sure there is significant room for improvement with strings, but it doesn't seem likely to me that Primitive will suddenly become an order of magnitude faster.

I'm going to close this issue unless anybody has any objections (I will leave #30 open).

ExpandingMan closed this as completed Sep 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filtering Arrow.Primitive is slow #31

Filtering Arrow.Primitive is slow #31

cstjean commented Jul 10, 2018

davidanthoff commented Jul 10, 2018

cstjean commented Jul 10, 2018 •

edited

Loading

davidanthoff commented Jul 10, 2018

ExpandingMan commented Jul 10, 2018 •

edited

Loading

cstjean commented Jul 10, 2018

ExpandingMan commented Sep 13, 2018

Filtering Arrow.Primitive is slow #31

Filtering Arrow.Primitive is slow #31

Comments

cstjean commented Jul 10, 2018

davidanthoff commented Jul 10, 2018

cstjean commented Jul 10, 2018 • edited Loading

davidanthoff commented Jul 10, 2018

ExpandingMan commented Jul 10, 2018 • edited Loading

cstjean commented Jul 10, 2018

ExpandingMan commented Sep 13, 2018

cstjean commented Jul 10, 2018 •

edited

Loading

ExpandingMan commented Jul 10, 2018 •

edited

Loading