-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Avoid heap allocation for function calls with a small number of args #5824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Didn't fully realize pybind11 continues to support C++11 and C++14, hence failing CI jobs. Can remove std::variant, which I think will also save memory and allow inline size increase. Would be nice to have directional review feedback if folks see this before I get that done, though. |
Awesome.
That'd be great. Could you please move WDYT about a more generic name, e.g. You'll have to update the top-level |
This was feasible mostly because we can take a bunch of implementation shortcuts knowing that the value type is py::handle, which is trivially copyable and trivially destructible, and knowing that we only have to do a limited subset of the vector interface. I think it would be best to be more specific, not less; I've thought about diverging from the vector interface entirely by requiring that the maximum size be specified up front (mandatory reserve) and not being able to grow in push_back at all, but I haven't done the legwork necessary to make sure that would actually be workable. |
Got it, thanks for the explanation.
…On Thu, Sep 4, 2025 at 00:08 Scott Wolchok ***@***.***> wrote:
*swolchok* left a comment (pybind/pybind11#5824)
<#5824 (comment)>
more generic name
This was feasible mostly because we can take a bunch of implementation
shortcuts knowing that the value type is py::handle, which is trivially
copyable and trivially destructible, and knowing that we only have to do a
limited subset of the vector interface. I think it would be best to be more
specific, not less; I've thought about diverging from the vector interface
entirely by requiring that the maximum size be specified up front
(mandatory reserve) and not being able to grow in push_back at all, but I
haven't done the legwork necessary to make sure that would actually be
workable.
—
Reply to this email directly, view it on GitHub
<#5824 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFUZAFU3NBNTBXZMMG4J5D3Q7QP7AVCNFSM6AAAAACFSXWKLGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTENJSGIZTAOBYGA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
…ents We don't have access to llvm::SmallVector or similar, but given the limited subset of the `std::vector` API that `function_call::args{,_convert}` need and the "reserve-then-fill" usage pattern, it is relatively straightforward to implement custom containers that get the job done. Seems to improves time to call the collatz function in pybind/pybind11_benchmark significantly; numbers are a little noisy but there's a clear improvement from "about 60 ns per call" to "about 45 ns per call" on my machine (M4 Max Mac), as measured with `timeit.repeat('collatz(4)', 'from pybind11_benchmark import collatz')`.
7c09a6b
to
9415686
Compare
(Please don't force push, because that makes it more difficult for me to follow along. Just keep merging, that works really well for me.) |
Hmm, looks like I managed to somehow break exactly IO redirection on exactly windows-latest but not windows-2022. Breakage includes clang-latest, so it's not an MSVC issue. Nothing jumps out at me as to the cause yet. |
Looks like windows-latest might have changed to windows-2025 very recently. Example job from yesterday where windows-latest meant windows-2022: https://github.com/pybind/pybind11/actions/runs/17443220310/job/49540880643 . GitHub does seem to say that the rollout started 2 days ago, so this is plausible: actions/runner-images#12677 I guess we should check whether the windows-latest workflows are broken on master without any changes? |
Sent #5825 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some super-quick comments, I can only spend 5 minutes right now. I'll probably get to it over the weekend.
include/pybind11/cast.h
Outdated
@@ -2045,10 +2046,12 @@ struct function_call { | |||
const function_record &func; | |||
|
|||
/// Arguments passed to the function: | |||
std::vector<handle> args; | |||
/// (Inline size chosen mostly arbitrarily; 5 should pad function_call out to two cache lines |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6 should
?
Maybe make this a constexpr unsigned argument_vector_small_size = 6;
or similar, because you're using the constant in three (at least) places?
include/pybind11/pybind11.h
Outdated
if (overloaded) { | ||
// We're in the first no-convert pass, so swap out the conversion flags for a | ||
// set of all-false flags. If the call fails, we'll swap the flags back in for | ||
// the conversion-allowed call below. | ||
second_pass_convert.resize(func.nargs, false); | ||
second_pass_convert = args_convert_vector<6>(func.nargs, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a terse comment to hint that creating a new object is better than some sort of resize semantics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reason I chose not to implement resize is that it's more general than we need -- you can resize containers to be smaller or larger and cross (or not cross) the inline size limit in either direction. we can just rewrite this in a way that more or less removes the question, though:
second_pass_convert = std::move(call.args_convert);
call.args_convert = args_convert_vector<argument_vector_small_size>(func.nargs, false);
(call.args_convert is moved-from, so we're not necessarily sure about its state.)
@@ -0,0 +1,84 @@ | |||
#include "pybind11/pybind11.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe create a new directory, test_low_level
or similar?
Putting this in test_embed
seems misleading.
@henryiii for opinion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't qualify for test/pure_cpp because it needs CPython around to compile py::handle
, so I put it in test_embed
because I was hoping not to have to duplicate the CMake configuration for C++ tests that need CPython around. I'll see if there's another way to do that, like creating a utility CMake file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if instead we renamed test_embed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't qualify for test/pure_cpp
Agreed/realized.
what if instead we renamed test_embed?
I'd be OK with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed test_embed
done
🐍 (macos-latest, 3.14t, -DCMAKE_CXX_STANDARD=20) / 🧪 It was hanging here:
I've seen this once before, a few days ago. Rerun triggered. |
…y clarify second_pass_convert
Small heads-up: I'll get to fully reviewing this only between Thu-Sun. How about "Low level" doesn't fit too well for the actual embedding tests. I think this accurately reflects that we're grouping tests together for a technical reason (I checked, all three existing |
Description
We don't have access to
llvm::SmallVector
or similar, but given the limited subset of thestd::vector
API thatfunction_call::args{,_convert}
need and the "reserve-then-fill" usage pattern, it is relatively straightforward to implement custom containers that get the job done.Seems to improves time to call the collatz function in pybind/pybind11_benchmark significantly; numbers are a little noisy but there's a clear improvement from "about 60 ns per call" to "about 45 ns per call" on my machine (M4 Max Mac), as measured with
timeit.repeat('collatz(4)', 'from pybind11_benchmark import collatz')
.Suggested changelog entry: