Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata and IPC part of the spec #4

Open
davidanthoff opened this issue Feb 17, 2018 · 35 comments
Open

Metadata and IPC part of the spec #4

davidanthoff opened this issue Feb 17, 2018 · 35 comments

Comments

@davidanthoff
Copy link
Contributor

I might just have missed it, but right now this package doesn't cover https://arrow.apache.org/docs/metadata.html and https://arrow.apache.org/docs/ipc.html, right?

Aren't those two parts the stuff you described as lacking in the arrow spec?

For example, to fully interop with the javascript side as described in https://github.com/apache/arrow/tree/master/js, wouldn't we require those parts as well?

@ExpandingMan
Copy link
Owner

ExpandingMan commented Feb 17, 2018

You're right that I don't cover the first part of this this about schemas yet nor have I worked on any inter-process communication stuff. I do however support the Arrow date time data types (don't have a section in README on it yet).

I have to admit to being somewhat confused by this page. What I was referring to in the other thread was that these schemas don't seem to specify an explicit way of communicating all of the pointers to data, it mainly seems concerned with giving some sort of summary metadata (am I wrong?). For an example of what I'm talking about, the Feather metadata format seems to have been completely pulled out of thin air, and doesn't seem related to these pages much at all, but what it describes is exactly the sort of thing you'd need to actually pull data out of a buffer.

I'm pretty sure that the stuff that appears in the documents you linked can simply be appended, and that there's nothing about the existing structure of Arrow.jl that would preclude this type of use case.

@davidanthoff
Copy link
Contributor Author

I think the RecordBatch and then the Buffer are essentially the pointers to the data? But yes, that whole writing seems not super clear...

I think the Feather format is essentially a different meta data format relative to the arrow metadata format?

@ExpandingMan
Copy link
Owner

Perhaps, but I don't find the metadata documentation clear at all.

Yes, it does seem that the Feather metadata is just it's own thing. I remain confused as to whether this is because Arrow doesn't specify it or Feather is just being difficult.

@davidanthoff
Copy link
Contributor Author

Maybe one can figure it out by looking at existing implementations? The Typescript one is probably easy to digest.

@randyzwitch
Copy link

I'm going to try and figure this out, as I need the Schema type for a work project.

@randyzwitch
Copy link

For the schema, from my reading it seems like you receive a pointer/size as your response, with the schema taking the shape of:

table Schema {

  /// endianness of the buffer
  /// it is Little Endian by default
  /// if endianness doesn't match the underlying system then the vectors need to be converted
  endianness: Endianness=Little;

  fields: [Field];
  // User-defined metadata
  custom_metadata: [ KeyValue ];
}

Unfortunately right now, I don't quite see how to use FlatBuffers.jl to read this, though I'm waiting for some help from internal (C++) folks who can confirm that the schema they are passing me is in fact the above

@ExpandingMan
Copy link
Owner

For an example of how to get FlatBuffers.jl to read it, look at Feather.jl.

@randyzwitch
Copy link

@ExpandingMan Unfortunately, Feather.jl doesn't quite help. I was hoping I could get away with this:

mutable struct Schema
    endianess::String
    fields::AbstractVector
    custom_metadata::String
end

julia> FlatBuffers.readbuffer(schema, 1, Schema)
ERROR: MethodError: Cannot `convert` an object of type Type{Schema} to an object of type Array{UInt8,1}
This may have arisen from a call to the constructor Array{UInt8,1}(...),
since type constructors fall back to convert methods.
Stacktrace:
 [1] read(::Base.AbstractIOBuffer{SubArray{UInt8,1,Array{UInt8,1},Tuple{UnitRange{Int64}},true}}, ::Type{T} where T) at ./io.jl:528
 [2] readbuffer(::Array{UInt8,1}, ::Int64, ::Type{Schema}) at /Users/randyzwitch/.julia/v0.6/FlatBuffers/src/internals.jl:9
 [3] macro expansion at /Users/randyzwitch/.julia/v0.6/Atom/src/repl.jl:118 [inlined]
 [4] anonymous at ./<missing>:?

I couldn't.

@ExpandingMan
Copy link
Owner

I don't think that's the right function to call, you are probably looking for FlatBuffers.read. You'll probably need to check out the FlatBuffers.jl documentation and the original flatbuffers documentation linked within. It would probably take a bit of time to look through Feather.jl and really understand what it's doing, but it's a pretty complete example of what you could do with Arrow.jl and fortunately isn't very much code.

@gabomgp
Copy link

gabomgp commented Sep 10, 2018

@randyzwitch I'm using a custom format based in Arrow Stream IPC format as payload in some REST Services. The backed is Python and the frontend Angular. And we need to integrate Julia in some microservices and I really would like to use Arrow as the IPC format between Julia and Python too. So, maybe you can point me in the correct direction to solve the reading and writing of tables using the IPC format of Arrow...?

@randyzwitch
Copy link

@gabomgp I haven't had time to build this in Julia. Most likely, I will be wrapping Arrow c_glib in the near term, then figuring out if it's worth the time for me to do it in native Julia.

The only references I've been working from is the Arrow documentation site and reading the Arrow developer email list

@ExpandingMan
Copy link
Owner

Unfortunately when I wrote this package I had a perspective which was rather skewed by the goal of using the Feather format which, for some reason, seemingly makes a bare minimal effort to provide arrow compatible data (it uses a metadata format that seemingly has nothing to do with arrow).

A better name for this package likely would have been ArrowArrays.jl.

@randyzwitch
Copy link

I don't think it's the case that this package isn't really "Arrow", just that there has been some drift. I do hope to get to the point where I can provide the IPC code back into this package, just that I really need to show some progress soon and wrapping the 3 C++ functions that I need is probably the quickest way forward!

@zhouyan
Copy link

zhouyan commented Jun 1, 2019

Is there any plan or interest in making this package fully compatible with the main Arrow project on the IPC area, in particular at least feature parity with pyarrow. @ExpandingMan If you think it is worthy goal I would like to take a shot at it unless some work is already under way.

It will take me some time, even months before I can contribute back a workable PR, but in the end this is what I would like to achieve: Support for Message/RecordBatch and a bare minimum Table type that facilitate such IPC transfer, which may not be feature rich but easily convertible to other data type such a DataFrame for more user friendly use cases.

The Feather format has its limitations, most of all it doesn't really support all the data types Arrow supported. I have had some good experience in using Arrow as a exchange format for data between R/Julia/Python/C++. However for now I am limited to only using data types supported by Feather.

@zhouyan
Copy link

zhouyan commented Jun 1, 2019

https://github.com/zhouyan/Arrow.jl/tree/feature/ipc

I created a simple proof-of-concept of reading record batch stream last night. Datetime types can be easily added with little effort and List can be supported with only a little effort as well. The test case binary file RecordBatchStream.out is generated from C++. A lot work need to be done before it is at least a proper prototype though.

@ExpandingMan
Copy link
Owner

I had started work on this in this branch. I made good progress, but I got derailed from it, for various reasons. I still intend to work on it, but I can't say when that might be.

@zhouyan
Copy link

zhouyan commented Jun 3, 2019

How about I work on it and you can review the results when you got time?

@ExpandingMan
Copy link
Owner

I'm not really interested in taking responsibility for maintenance of a full-blown C++ wrapper. However, if you create a full working C++ wrapper package, I would be glad to de-register this Arrow.jl so that you can have the name and be a "standard" Julia arrow package if you so wish. I'd even help with moving Feather.jl over to that package.

@zhouyan
Copy link

zhouyan commented Jun 3, 2019

I am not interested in a C++ wrapper, I am interested in a pure Julia implementation. The reason I generated the test file with C++ is because I haven't implemented the write part yet, and besides I think it is important to test cross-language compatibility, that is Julia implementation shall be able to read record batches written by other implementations and vice versa

@ExpandingMan
Copy link
Owner

The reason I'm hesitating to embrace this is that I sort of feel that this has to be rebuilt from the ground up, like you see in my new branch. If you were to do that, it'd make more sense for it to be your package. That'd be great, I'd be very happy if you or anyone else came along with a full working package, I'm just not sure what my involvement is at that point.

I'd say just go ahead and do what you wanted to do. Again, if you have a full working version, we can make sure that Arrow.jl is registered to that. I can probably help out to some extent, but business with other things combined with waning interest in the arrow spec has made it hard for me to get that motivated with this lately. That could change if I suddenly found myself with lots of use for it.

@zhouyan
Copy link

zhouyan commented Jun 3, 2019 via email

@ExpandingMan
Copy link
Owner

ExpandingMan commented Jun 3, 2019

Sure, take whatever you want. That's the reason it's open source with an permissive license.

By the way, I suggest you check out my new branch and try to stick closer to that than the old stuff. The new branch is a lot more well thought out.

@zhouyan
Copy link

zhouyan commented Jun 3, 2019

sure, thanks. I probably get around to do it sometime next month after some work stuff sorted out. Meanwhile I will study the new branch and existing code to plan the structure and design.

@davidanthoff
Copy link
Contributor Author

de-register this Arrow.jl

if you have a full working version, we can make sure that Arrow.jl is registered to that

I don't think these are options, even with the new package manager. I think there are two potential ways forward: a) a new package with a new name, or b) a PR that just changes the content of this package here.

@ExpandingMan
Copy link
Owner

I'm not sure whether it's possible, but please don't panic. I did not mean to imply that this was imminent by any means, if this does happen I'm sure it would be a good while before it does and I will make sure it is done carefully.

That said, this has piqued my interest a bit and I may go back to this this week.

If I can at least get a subset of it up and running, I think it would be far easier for anyone else to complete it than if I left it to its current state.

@zhouyan
Copy link

zhouyan commented Jun 3, 2019 via email

@randyzwitch
Copy link

randyzwitch commented Jun 3, 2019 via email

@ExpandingMan
Copy link
Owner

I have to say that all this interest is piquing my interest. I'm at least going to get a working minimal prototype out of my new branch, I don't know if I'll finish it, but really once I get the metadata reading in in a robust way there'll at least be a minimal working version. Stay tuned.

@zhouyan
Copy link

zhouyan commented Jun 3, 2019 via email

@ExpandingMan
Copy link
Owner

Ideally the ArrowVectors should be views into data which are created by reading the metadata. This way, the overhead of creating them should be the same as the overhead of creating the metadata plus a few small allocations, accessing them should be basically free (there's usually very little to allocate in the way of metadata). That way you could also do no copy memory mapping. Again, my new branch is much cleaner.

@zhouyan
Copy link

zhouyan commented Jun 4, 2019 via email

@ExpandingMan
Copy link
Owner

I found that avoiding the distinction between e.g. Primitive and NullablePrimitive is more effort than it's worth. I also found that one really should avoid conflating the "logical" memory types with the "base" data types. This was one of my major mistakes the first time around that caused me to decide to re-implement the whole thing. You'll notice, for example, that my new List objects would now have element types of Vector{UInt8} rather than String. This is quite deliberate, even though it will require another layer of wrappers.

Of course, how you do it is entirely up to you, but I did learn a few lessons the first time around on this.

Anyway, are you on the Julia slack? It would probably be better to discuss these things on there.

@zhouyan
Copy link

zhouyan commented Jun 4, 2019

Not yet but I can join the Julia slack if someone send an invite or link.
Sure, I probably will learn the same lesson after a few tries. I come from working the C++ implementation. At first I hated how it handled the type system but overtime I started to see some of the reasons behind it

@ExpandingMan
Copy link
Owner

See here for slack invites if interested.

@zhouyan
Copy link

zhouyan commented Jun 4, 2019

thanks, just joined.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants