-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Expose the ability to have zero allocation sends. #4802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Expose the ability to have zero allocation sends. #4802
Conversation
|
@bluca Thoughts? |
include/zmq.h
Outdated
| #else | ||
| unsigned char _[64]; | ||
| #endif | ||
| } zmq_msg_content_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you adding a new type that is identical to the existing one? Just reuse that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @bluca Thanks a bunch for looking at this,
I just changed this for zmq_msg_content_t to be a typedef alias of the zmq_msg_t.
I wanted them to be different types because conceptually they are - They just happen to be the same size/alignment requirements but they are used differently internally. One is a block for a zeroMQ message to be built in while the other is for a control block. I did not want to get confused about what types were going where/ what they were being used for.
Hence the desire for at least differently named types (even if they are just aliases now)
(I tested the change on our code base, and there are no issues as one would expect - Though I am not sure about what is going on with the MacOS CI here...)
The Objective:
At a high level I want use zeromq messages to send data as fast as possible from a sending thread. However, as as I have started going through the optimization process, via perf on x64 linux, about 25% of my steady-state time is spent in malloc calls allocating the control blocks for the reference counted long messages.
This pull request is an request to expose a public api to the zero-copy long message type
type_zclmsgused internally on the receive side.References:
As part of this process I have looked at previous issues that reference this topic, trying to not stomp on things, and get a better understanding of the concerns (If I have missed some it would be great to know):
The most discussion of this seems to be in:
#2795
Though this PR also solves the issue here:
#4343
Changes:
The primary change in this is to expose the internal function
init_external_storageand allow users to pass in a pointer to a preallocated memory block for the init function to construct the content_t control block in. The method/struct we want to expose is below:In order to not expose private implementation details, and allow for future modification of the internals, we round up the control block size from ~40 to 64 bytes, and expose the larger structure to the users in the draft api.
Thus, users just allocate a 64 byte control block, and pass it to zmq. They are then responsible for the lifetime of the block (In most use cases, it will probably be handled in the zmq_free_fn * ).
Internally all objects in the control data block are manually destructed as part of zmq_close (any nontrivial types are constructed with placement new):
My Use-case:
So for my sending thread, I have a slab allocator, which pulls memory from a lock-free object pool. Under the hood this uses freelist that batches memory allocations. This is fast, though to reduce thread contention of the CAS loop, I generally pull larger chunks and construct multiple smaller messages within this allocated buffer.
The messages I am using are not huge, but not tiny, at around 1Kb, so I am currently using the
zmq_msg_init_data, this works, and there is a inline reference counted control block at the beginning of the allocation that is decremented in the zmq'sfree_fn.At this point, the
malloctime from control block allocation starts to add up.I realize I could send larger messages, and I will be doing that, but it involves rewriting a lot of the code and message logic, on both the send and receive. Just allowing me to point to a preallocated control block at the beginning of the message gives me an easy ~25% speedup.
Additionally I am generally wary of non-deterministic nature of new/delete, especially in the hot path loop.
Gotchas? / random thoughts?
So trying to think of issues with this, it seems pretty safe, obviously it involves properly managing/releasing the control block memory, but that seems pretty easy for people already using free function. It is not a solution for all issues, but the requirement of zmq to have new/delete for any messages above ~33 bytes seems like a less than ideal scenario.
It also seems like the content object, as a internal API is very stable (last touched 9 years ago?).
One question I was thinking of might be the alignment, 8 bytes on x64 is the minimum, and was easy to copy paste the msg_t alignment, but it may be better to up that to a larger 16 byte, if there is the potential of needing to use 128 bit cas type instructions... That being said, people who use the library should probably respect the the alignment of the types they are given and not assume.
Finally, as a gripe, I personally am not a fan of the name of content_t, since it is not actually the message content but the control block for the message content, it keeps confusing me (why I use control block when I refer to it in this PR)