-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metadata and IPC part of the spec #4
Comments
You're right that I don't cover the first part of this this about schemas yet nor have I worked on any inter-process communication stuff. I do however support the Arrow date time data types (don't have a section in README on it yet). I have to admit to being somewhat confused by this page. What I was referring to in the other thread was that these schemas don't seem to specify an explicit way of communicating all of the pointers to data, it mainly seems concerned with giving some sort of summary metadata (am I wrong?). For an example of what I'm talking about, the Feather metadata format seems to have been completely pulled out of thin air, and doesn't seem related to these pages much at all, but what it describes is exactly the sort of thing you'd need to actually pull data out of a buffer. I'm pretty sure that the stuff that appears in the documents you linked can simply be appended, and that there's nothing about the existing structure of Arrow.jl that would preclude this type of use case. |
I think the I think the Feather format is essentially a different meta data format relative to the arrow metadata format? |
Perhaps, but I don't find the metadata documentation clear at all. Yes, it does seem that the Feather metadata is just it's own thing. I remain confused as to whether this is because Arrow doesn't specify it or Feather is just being difficult. |
Maybe one can figure it out by looking at existing implementations? The Typescript one is probably easy to digest. |
I'm going to try and figure this out, as I need the |
For the schema, from my reading it seems like you receive a pointer/size as your response, with the schema taking the shape of:
Unfortunately right now, I don't quite see how to use FlatBuffers.jl to read this, though I'm waiting for some help from internal (C++) folks who can confirm that the schema they are passing me is in fact the above |
For an example of how to get FlatBuffers.jl to read it, look at Feather.jl. |
@ExpandingMan Unfortunately, Feather.jl doesn't quite help. I was hoping I could get away with this:
I couldn't. |
I don't think that's the right function to call, you are probably looking for |
@randyzwitch I'm using a custom format based in Arrow Stream IPC format as payload in some REST Services. The backed is Python and the frontend Angular. And we need to integrate Julia in some microservices and I really would like to use Arrow as the IPC format between Julia and Python too. So, maybe you can point me in the correct direction to solve the reading and writing of tables using the IPC format of Arrow...? |
@gabomgp I haven't had time to build this in Julia. Most likely, I will be wrapping Arrow c_glib in the near term, then figuring out if it's worth the time for me to do it in native Julia. The only references I've been working from is the Arrow documentation site and reading the Arrow developer email list |
Unfortunately when I wrote this package I had a perspective which was rather skewed by the goal of using the Feather format which, for some reason, seemingly makes a bare minimal effort to provide arrow compatible data (it uses a metadata format that seemingly has nothing to do with arrow). A better name for this package likely would have been |
I don't think it's the case that this package isn't really "Arrow", just that there has been some drift. I do hope to get to the point where I can provide the IPC code back into this package, just that I really need to show some progress soon and wrapping the 3 C++ functions that I need is probably the quickest way forward! |
Is there any plan or interest in making this package fully compatible with the main Arrow project on the IPC area, in particular at least feature parity with pyarrow. @ExpandingMan If you think it is worthy goal I would like to take a shot at it unless some work is already under way. It will take me some time, even months before I can contribute back a workable PR, but in the end this is what I would like to achieve: Support for Message/RecordBatch and a bare minimum Table type that facilitate such IPC transfer, which may not be feature rich but easily convertible to other data type such a DataFrame for more user friendly use cases. The Feather format has its limitations, most of all it doesn't really support all the data types Arrow supported. I have had some good experience in using Arrow as a exchange format for data between R/Julia/Python/C++. However for now I am limited to only using data types supported by Feather. |
https://github.com/zhouyan/Arrow.jl/tree/feature/ipc I created a simple proof-of-concept of reading record batch stream last night. Datetime types can be easily added with little effort and List can be supported with only a little effort as well. The test case binary file RecordBatchStream.out is generated from C++. A lot work need to be done before it is at least a proper prototype though. |
I had started work on this in this branch. I made good progress, but I got derailed from it, for various reasons. I still intend to work on it, but I can't say when that might be. |
How about I work on it and you can review the results when you got time? |
I'm not really interested in taking responsibility for maintenance of a full-blown C++ wrapper. However, if you create a full working C++ wrapper package, I would be glad to de-register this Arrow.jl so that you can have the name and be a "standard" Julia arrow package if you so wish. I'd even help with moving Feather.jl over to that package. |
I am not interested in a C++ wrapper, I am interested in a pure Julia implementation. The reason I generated the test file with C++ is because I haven't implemented the |
The reason I'm hesitating to embrace this is that I sort of feel that this has to be rebuilt from the ground up, like you see in my new branch. If you were to do that, it'd make more sense for it to be your package. That'd be great, I'd be very happy if you or anyone else came along with a full working package, I'm just not sure what my involvement is at that point. I'd say just go ahead and do what you wanted to do. Again, if you have a full working version, we can make sure that Arrow.jl is registered to that. I can probably help out to some extent, but business with other things combined with waning interest in the arrow spec has made it hard for me to get that motivated with this lately. That could change if I suddenly found myself with lots of use for it. |
Sure, no problem. But I will likely need to borrow a lot logic already in this package, maybe even some code, if I am gonna start a new package from ground up. Of course credit will be given where it’s due but I don’t want to reinvent the wheel either. If that’s OK with you then I will be glad to start a new package with full implementation of Arrow in mind and take responsibility for it once it’s done.
… On Jun 3, 2019, at 09:52, ExpandingMan ***@***.***> wrote:
The reason I'm hesitating to embrace this is that I sort of feel that this has to be rebuilt from the ground up, like you see in my new branch. If you were to do that, it'd make more sense for it to be your package. That'd be great, I'd be very happy if you or anyone else came along with a full working package, I'm just not sure what my involvement is at that point.
I'd say just go ahead and do what you wanted to do. Again, if you have a full working version, we can make sure that Arrow.jl is registered to that. I can probably help out to some extent, but business with other things combined with waning interest in the arrow spec has made it hard for me to get that motivated with this lately. That could change if I suddenly found myself with lots of use for it.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Sure, take whatever you want. That's the reason it's open source with an permissive license. By the way, I suggest you check out my new branch and try to stick closer to that than the old stuff. The new branch is a lot more well thought out. |
sure, thanks. I probably get around to do it sometime next month after some work stuff sorted out. Meanwhile I will study the new branch and existing code to plan the structure and design. |
I don't think these are options, even with the new package manager. I think there are two potential ways forward: a) a new package with a new name, or b) a PR that just changes the content of this package here. |
I'm not sure whether it's possible, but please don't panic. I did not mean to imply that this was imminent by any means, if this does happen I'm sure it would be a good while before it does and I will make sure it is done carefully. That said, this has piqued my interest a bit and I may go back to this this week. If I can at least get a subset of it up and running, I think it would be far easier for anyone else to complete it than if I left it to its current state. |
Let’s not worry much about the registry for now. It will be quite a while before I can even start working on it at full speed. When and it’s done we can then assess if it is suitable as a replacement or a separate package or maybe it is worthless junk.
…Sent from my iPad
On Jun 4, 2019, at 05:52, ExpandingMan ***@***.***> wrote:
I'm not sure whether it's possible, but please don't panic. I did not mean to imply that this was imminent by any means, if this does happen I'm sure it would be a good while before it does and I will make sure it is done carefully.
That said, this has piqued my interest a bit and I may go back to this this week.
If I can at least get a subset of it up and running, I think it would be far easier for anyone else to complete it than if I left it to its current state.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
When you get your code back @ExpandingMan, I’ve got a real-world use case to test IPC on and would be happy to do so
… On Jun 3, 2019, at 5:52 PM, ExpandingMan ***@***.***> wrote:
I'm not sure whether it's possible, but please don't panic. I did not mean to imply that this was imminent by any means, if this does happen I'm sure it would be a good while before it does and I will make sure it is done carefully.
That said, this has piqued my interest a bit and I may go back to this this week.
If I can at least get a subset of it up and running, I think it would be far easier for anyone else to complete it than if I left it to its current state.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I have to say that all this interest is piquing my interest. I'm at least going to get a working minimal prototype out of my new branch, I don't know if I'll finish it, but really once I get the metadata reading in in a robust way there'll at least be a minimal working version. Stay tuned. |
You can have a look of my branch of reading in the Metadata and the Message body into Primitive
https://github.com/zhouyan/ArrowFork.jl/tree/feature/ipc <https://github.com/zhouyan/ArrowFork.jl/tree/feature/ipc>
I haven’t worked on it since the first couple commits a few nights ago. I got stuck with some work stuff and haven’t got around to work the rest of it yet. After the talk of the new package I am playing around with an alternative implementation design. It will make a small scarifies of performance if one want to access the ArrowVector directly but easier to serialize and deserialize. The goal is to get Arrow data in and out of native Julia structures fast (e.g., often from one of the IPC format). My main reasoning is that working with Arrow data may incur a cost at the scale of `memcpy` to get it into native structures, but the overall performance shall be better because of all the optimized code that has been written around the native structures. Essentially making ArrowVector a mid-way structure between Julia and IPC format.
… On Jun 4, 2019, at 07:32, ExpandingMan ***@***.***> wrote:
I have to say that all this interest is piquing my interest. I'm at least going to get a working minimal prototype out of my new branch, I don't know if I'll finish it, but really once I get the metadata reading in in a robust way there'll at least be a minimal working version. Stay tuned.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#4?email_source=notifications&email_token=AAEDP2M6K3HKQISFEMTKZVLPYWSZVA5CNFSM4EREXXJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW27GLY#issuecomment-498463535>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEDP2PZXBBUM35CJD36X4LPYWSZVANCNFSM4EREXXJQ>.
|
Ideally the |
Yea, that’s the idea,
Buffers are only allocated when creating it from Julia types to be serialized (and not always necessary). While reading the IPC format it contains the metadata and reference to the message body as view. And the `memcpy` cost happens when it is accessed to be converted to say `Vector{T}`, That cost technically can be avoided but some time it is better the price once instead of in a lot small places. A simple example, for a nullable array, indexing into the array means bitwise operations on the null_bitmap while working with Vector{Union{Missing,T}} is more convenient and sometime yield better performance. Of course, for non-nullable array, accessing the raw buffer is as fast as access Vector{T} minus the cost of copy. However, I would like to avoid the distinction between say NullablePrimitive and Primitive, or List etc, just one unified parameter type ArrowVector{T} where the parameter T (= Int32, List{Int32}, …) affect how the buffers are accessed via method dispatch, while the member data are the same regardless of the types (similar to `ArrayData` in the reference C++ implementation). So we pay a little space cost on the stack for say Primitive types but a much cleaner structure for IPC.
… On Jun 4, 2019, at 07:53, ExpandingMan ***@***.***> wrote:
Ideally the ArrowVectors should be views into data which are created by reading the metadata. This way, the overhead of creating them should be the same as the overhead of creating the metadata plus a few small allocations, accessing them should be basically free (there's usually very little to allocate in the way of metadata). That way you could also do no copy memory mapping. Again, my new branch is much cleaner.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#4?email_source=notifications&email_token=AAEDP2LCC3MFFU3QSPQMS43PYWVJHA5CNFSM4EREXXJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW3AFQI#issuecomment-498467521>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEDP2JMTCL3GC4DHNRYB4LPYWVJHANCNFSM4EREXXJQ>.
|
I found that avoiding the distinction between e.g. Of course, how you do it is entirely up to you, but I did learn a few lessons the first time around on this. Anyway, are you on the Julia slack? It would probably be better to discuss these things on there. |
Not yet but I can join the Julia slack if someone send an invite or link. |
See here for slack invites if interested. |
thanks, just joined. |
I might just have missed it, but right now this package doesn't cover https://arrow.apache.org/docs/metadata.html and https://arrow.apache.org/docs/ipc.html, right?
Aren't those two parts the stuff you described as lacking in the arrow spec?
For example, to fully interop with the javascript side as described in https://github.com/apache/arrow/tree/master/js, wouldn't we require those parts as well?
The text was updated successfully, but these errors were encountered: