Accessing Arrow.List is rather slow #30

cstjean · 2018-06-29T17:08:19Z

On 0.6,

julia> using BenchmarkTools

julia> t1 = ["A", "A", "A"];

julia> t2 = arrowformat(t1);

julia> @btime $t1[2]
  2.025 ns (0 allocations: 0 bytes)
"A"

julia> @btime $t2[2]
  17.286 ns (1 allocation: 32 bytes)
"A"

I get that it's fundamentally performing more operations, but an 8X factor is pretty steep, and the allocation should not be necessary, right?

cstjean · 2018-06-29T18:25:33Z

For context, this is what I'm trying to do: JuliaData/CategoricalArrays.jl#150. I think Arrow is suffering from the same issue with DictEncoding.

ExpandingMan · 2018-06-29T18:26:02Z

Yup, it really sucks but it really does seem to be pretty close to optimal, as far as I can tell.

The problem with strings is that there's really a lot that needs to happen. The allocation is probably coming from the fact that you actually need to do a computation to get the unsafe_string call with the pointer data that is stored in the arrow format. The offset data tells you the locations of the ends of the string, unsafe_string needs the start location and the length. As far as I know, there's no way around subtracting to get the length and allocating for that value.

Actually, one thing that I'm really uncomfortable about right now is that currently no checks are done at all on the string pointers stored in the arrow data. So, if you get a corrupted arrow file, it's possible to get a bogus pointer that tries to put it past the end of the string data and segfaults. I experimented a fair amount with putting in checks, but they were all so slow I couldn't bear to leave them in.

Help with this is extremely welcome if you know of any ways of improving it (even if that help comes in the form of you telling me what to do rather than doing it yourself). Last time I looked into this, I just plain didn't know how to do any better.

As far as your CategoricalArray issue goes, I think the only way to do it is to explicitly add some methods for the broadcast comparison.

cstjean mentioned this issue Jul 10, 2018

Filtering Arrow.Primitive is slow #31

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accessing Arrow.List is rather slow #30

Accessing Arrow.List is rather slow #30

cstjean commented Jun 29, 2018

cstjean commented Jun 29, 2018

ExpandingMan commented Jun 29, 2018 •

edited

Loading

Accessing Arrow.List is rather slow #30

Accessing Arrow.List is rather slow #30

Comments

cstjean commented Jun 29, 2018

cstjean commented Jun 29, 2018

ExpandingMan commented Jun 29, 2018 • edited Loading

ExpandingMan commented Jun 29, 2018 •

edited

Loading