Add per-instance state lifecycle to VTableOverlay#1019
Conversation
ac58fe9 to
c4c7c27
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1019 +/- ##
==========================================
+ Coverage 85.99% 86.12% +0.12%
==========================================
Files 16 17 +1
Lines 5077 5116 +39
==========================================
+ Hits 4366 4406 +40
+ Misses 711 710 -1
🚀 New features to boost your workflow:
|
c4c7c27 to
3701275
Compare
3701275 to
f107487
Compare
Vipul-Cariappa
left a comment
There was a problem hiding this comment.
My understanding of this PR is as follows:
We introduce a mechanism to store some additional data on the vtable of the instance. These data are n void*. We also add an optional destructor overload. A custom destructor that the user wants to call after the call to the actual C++ destructor (I believe the custom dtor should be called before the actual dtor). We also add an argument to the overloaded destructor: void* for any additional data.
Please correct me if my understanding of things is wrong.
The commit message is horrible. Does not tell what is being done or why. Please make sure the commit message is firstly human readable. I also believe the comments in the source file are too verbose. But maybe it is required if someone needs to understand this code. But there are repetitive comments; these can be removed.
I cannot guarantee that the pointer arithmetic magic being does is correct; I expect there to be sufficient tests to catch any errors in it. FYI, I could not think of any edge cases that the current tests don't already cover.
| static void BM_XTU_OverlayWithDtorHook(benchmark::State& state) { | ||
| Impl impl; | ||
| auto* base = ReflectOverlayBase(); | ||
| void* fn = TestUtils::BitCastFn<void*>(&xtu_replacement); | ||
| Cpp::TCppConstFunction_t method = Frob(base); | ||
| auto* ov = Cpp::MakeVTableOverlay( | ||
| &impl, base, &method, &fn, /*n=*/1, | ||
| /*n_extra_prefix_slots=*/0, | ||
| /*on_destroy=*/&xtu_noop_cleanup, | ||
| /*cleanup_data=*/nullptr); | ||
| for (auto _ : state) | ||
| benchmark::DoNotOptimize(OverlayDispatchOnce(&impl, 7)); | ||
| Cpp::DestroyVTableOverlay(ov); | ||
| } | ||
| BENCHMARK(BM_XTU_OverlayWithDtorHook); |
There was a problem hiding this comment.
You are benchmarking this, but don't EXPECT_AT_LEAST_N_TIMES_FASTER, why?
f107487 to
7f605ef
Compare
7f605ef to
dd81118
Compare
Bindings layering on top of MakeVTableOverlay -- cppyy's Python handlers, Lua closures, RooFit's per-object trampolines -- need two things the initial primitive does not provide: 1. A place to stash per-instance state adjacent to the vtable, so a thunk reads it with a single fixed-offset load. Without this the binding falls back on a process-global registry keyed by `self` pointer -- hash + mutex per dispatch. 2. A tear-down moment tied to the C++ object's lifetime so the state can be released. Without this the binding either leaks (host refcounts on the wrapper pin it open and deadlock the GC) or threads cleanup through every potential destruction path: temporaries expiring at semicolon, container element drop, exception unwind, shared_ptr cycle break. MakeVTableOverlay grows three new optional parameters. \c n_extra_prefix_slots reserves nullptr-initialised \c void* cells before the ABI prefix in the owned block; thunks reach them through the inline \c VTableOverlayExtraSlot(self, i) helper. \c on_destroy and \c cleanup_data install a per-instance trampoline at the deleting-dtor slot (D0 on Itanium, the single deleting dtor on MSVC) that runs the callback first -- while \c inst is still alive and its state inspectable -- and then the original destructor. After the original returns, the C++ memory has been freed. The callback must not throw (running inside a C++ destructor) and must not destroy the overlay (the original destructor still has to run after it returns). Routing from `self` to the owning VTableOverlay goes through a hidden self-pointer slot interposed between the user's extra prefix slots and the ABI prefix. The wrapper reads it via a fixed-offset load from `self`'s vptr -- no per-interpreter map, no mutex, no walk. The public detail::kVTableOverlayPrefixSize accounts for the hidden slot so the extra-slot offset formula in the inline helper is unchanged. DestroyVTableOverlay is idempotent against both directions of the race: a \c dtor_fired flag suppresses the vptr restore when the wrapper has started, and caller-driven teardown restores the original vptr so the wrapper is no longer reachable from \c inst. Tests cover the four observable behaviours (DestructorHookFires, DestructorHookIsPerInstance, DestructorHookOptInZero, DestructorHookSkippedOnCallerTeardown) plus SetsExtraPrefixSlots; the cross-TU bench adds BM_XTU_OverlayExtraSlot vs BM_XTU_OverlayGlobalMap with an EXPECT_AT_LEAST_N_TIMES_FASTER assertion pinning the extra-slot vs registry lookup gap, plus a BM_XTU_OverlayWithDtorHook trend cell (no assertion -- the dtor wrapper patches a slot the bench does not dispatch through, so the per-call cost is structurally unchanged).
Bindings layering a precompiled thunk catalog on top of VTableOverlay need one dispatcher function per (signature, slot) cell in the table. The body is mechanical -- look up the handler from a per-instance location, marshal Args..., call the bound callable, marshal the result. The marshaling is binding-specific (cppyy uses its converter framework; a future Lua / R binding would use its own); the lookup is also binding-specific (cppyy points at the extra prefix slot reserved via MakeVTableOverlay's n_extra_prefix_slots). CppInterOpThunks.h ships the variadic template the binding instantiates per (R, Args..., Slot) cell. The binding contributes a HandlerTraits struct with From(self) and Call<R, Args...>(handler, slot, args...); CppInterOp owns only the static dispatcher glue. Header-only and self-contained -- no dependency on CppInterOp's runtime entry points. Identical instantiations fold to the same .text, so a catalog of N cells over M signatures costs N*M function pointers and at most M distinct bodies. The variadic shape lets the binding extend the catalog with new (R, Args...) tuples by listing more rows, not by hand-writing more thunks. CppInterOpThunksTest.cpp documents the contract end-to-end: a stand-in HandlerTraits exercises three instantiations through dispatch and verifies (a) the call forwards through From + Call with the expected arguments, (b) distinct (Slot, R, Args...) tuples produce distinct function-pointer addresses, and (c) the void-return / no-args shape compiles and runs.
dd81118 to
0bb5693
Compare
Vipul-Cariappa
left a comment
There was a problem hiding this comment.
I don't completely understand why CppImpl::Thunks::dispatch is required. But maybe I will understand it better once I see the associated cppyy PR.
Other than that, this looks good to me!
Bindings layering on top of MakeVTableOverlay -- cppyy's Python handlers, a Lua binding's closures, RooFit's per-object trampolines -- need two things from the overlay primitive: a place to stash per-instance state adjacent to the vtable so a thunk reads it with a fixed-offset load, and a tear-down moment tied to the C++ object's lifetime so the state can be released. Without the first they pay a hash + mutex per dispatch through a process-global handler map (BM_XTU_OverlayGlobalMap baseline at ~17 ns/call); without the second they either leak the state (refcounts on the host wrapper pin it open and deadlock the GC) or thread cleanup through every potential destruction path -- a temporary expiring at semicolon, container element drop, exception unwind, shared_ptr cycle break.
MakeVTableOverlay grows three new optional parameters. n_extra_prefix_slots reserves nullptr-initialized void* cells before the ABI prefix in the owned block; thunks reach them through the inline VTableOverlayExtraSlot(self, i) helper -- one fixed-offset load, no registry. on_destroy and cleanup_data install a per-instance trampoline at the deleting-dtor slot (D0 on Itanium, the single deleting dtor on MSVC) that runs the original destructor first, then the cleanup callback. Original-first ordering keeps virtuals dispatched from inside ~T()'s body resolving through the still- installed overlay; the callback runs only when no further virtuals can fire. The callback must not throw -- it runs inside a C++ destructor, so an escaping exception is undefined. Per-instance dtor-hook metadata lives on the active CppImpl::Interpreter (mirroring #900 DLM move) and the wrapper walks the interpreter list at fire time, so the orig_dtor JIT pointer never outlives its emitter and overlay blocks carry no routing slot. DestroyVTableOverlay is idempotent against both directions of the race: an internal dtor_fired flag suppresses the vptr restore on a dead instance, and caller-driven teardown drops the matching hook entry across all interpreters before restoring.
Five new unit tests pin the contract. SetsExtraPrefixSlots verifies the slots are nullptr-initialized + writable and that overlaid dispatch still flows through the new layout. DestructorHookFires (callback runs once after operator delete with the original inst), DestructorHookIsPerInterpreter (two interpreters, no cross-map bleed), DestructorHookOptInZero (with on_destroy=nullptr the deleting-dtor slot is the verbatim copy from the original vtable), and DestructorHookSkippedOnCallerTeardown (race where teardown beats the wrapper, callback never fires) pin the four corners of the dtor-hook contract. Two new bench cells track dispatch-path cost. BM_XTU_OverlayExtraSlot at ~3.9 ns/call asserts >= 2x faster than BM_XTU_OverlayGlobalMap, the registry-lookup alternative. BM_XTU_OverlayWithDtorHook prints its per-call cost for trend visibility -- the wrapper patches a different vtable slot from the dispatched method's, so cost is structurally unchanged, but at sub-10 ns precision macOS CI jitter dominates any relative threshold. Both features are opt-in: n_extra_prefix_slots = 0 and on_destroy = nullptr (the defaults) preserve the prior behavior bit for bit.