Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accessing Arrow.List is rather slow #30

Open
cstjean opened this issue Jun 29, 2018 · 2 comments
Open

Accessing Arrow.List is rather slow #30

cstjean opened this issue Jun 29, 2018 · 2 comments

Comments

@cstjean
Copy link

cstjean commented Jun 29, 2018

On 0.6,

julia> using BenchmarkTools

julia> t1 = ["A", "A", "A"];

julia> t2 = arrowformat(t1);

julia> @btime $t1[2]
  2.025 ns (0 allocations: 0 bytes)
"A"

julia> @btime $t2[2]
  17.286 ns (1 allocation: 32 bytes)
"A"

I get that it's fundamentally performing more operations, but an 8X factor is pretty steep, and the allocation should not be necessary, right?

@cstjean
Copy link
Author

cstjean commented Jun 29, 2018

For context, this is what I'm trying to do: JuliaData/CategoricalArrays.jl#150. I think Arrow is suffering from the same issue with DictEncoding.

@ExpandingMan
Copy link
Owner

ExpandingMan commented Jun 29, 2018

Yup, it really sucks but it really does seem to be pretty close to optimal, as far as I can tell.

The problem with strings is that there's really a lot that needs to happen. The allocation is probably coming from the fact that you actually need to do a computation to get the unsafe_string call with the pointer data that is stored in the arrow format. The offset data tells you the locations of the ends of the string, unsafe_string needs the start location and the length. As far as I know, there's no way around subtracting to get the length and allocating for that value.

Actually, one thing that I'm really uncomfortable about right now is that currently no checks are done at all on the string pointers stored in the arrow data. So, if you get a corrupted arrow file, it's possible to get a bogus pointer that tries to put it past the end of the string data and segfaults. I experimented a fair amount with putting in checks, but they were all so slow I couldn't bear to leave them in.

Help with this is extremely welcome if you know of any ways of improving it (even if that help comes in the form of you telling me what to do rather than doing it yourself). Last time I looked into this, I just plain didn't know how to do any better.

As far as your CategoricalArray issue goes, I think the only way to do it is to explicitly add some methods for the broadcast comparison.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants