Skip to content

Add per-instance state lifecycle to VTableOverlay#1019

Merged
vgvassilev merged 2 commits into
compiler-research:mainfrom
vgvassilev:vtable-overlay
Jun 3, 2026
Merged

Add per-instance state lifecycle to VTableOverlay#1019
vgvassilev merged 2 commits into
compiler-research:mainfrom
vgvassilev:vtable-overlay

Conversation

@vgvassilev
Copy link
Copy Markdown
Contributor

@vgvassilev vgvassilev commented May 31, 2026

Bindings layering on top of MakeVTableOverlay -- cppyy's Python handlers, a Lua binding's closures, RooFit's per-object trampolines -- need two things from the overlay primitive: a place to stash per-instance state adjacent to the vtable so a thunk reads it with a fixed-offset load, and a tear-down moment tied to the C++ object's lifetime so the state can be released. Without the first they pay a hash + mutex per dispatch through a process-global handler map (BM_XTU_OverlayGlobalMap baseline at ~17 ns/call); without the second they either leak the state (refcounts on the host wrapper pin it open and deadlock the GC) or thread cleanup through every potential destruction path -- a temporary expiring at semicolon, container element drop, exception unwind, shared_ptr cycle break.

MakeVTableOverlay grows three new optional parameters. n_extra_prefix_slots reserves nullptr-initialized void* cells before the ABI prefix in the owned block; thunks reach them through the inline VTableOverlayExtraSlot(self, i) helper -- one fixed-offset load, no registry. on_destroy and cleanup_data install a per-instance trampoline at the deleting-dtor slot (D0 on Itanium, the single deleting dtor on MSVC) that runs the original destructor first, then the cleanup callback. Original-first ordering keeps virtuals dispatched from inside ~T()'s body resolving through the still- installed overlay; the callback runs only when no further virtuals can fire. The callback must not throw -- it runs inside a C++ destructor, so an escaping exception is undefined. Per-instance dtor-hook metadata lives on the active CppImpl::Interpreter (mirroring #900 DLM move) and the wrapper walks the interpreter list at fire time, so the orig_dtor JIT pointer never outlives its emitter and overlay blocks carry no routing slot. DestroyVTableOverlay is idempotent against both directions of the race: an internal dtor_fired flag suppresses the vptr restore on a dead instance, and caller-driven teardown drops the matching hook entry across all interpreters before restoring.

Five new unit tests pin the contract. SetsExtraPrefixSlots verifies the slots are nullptr-initialized + writable and that overlaid dispatch still flows through the new layout. DestructorHookFires (callback runs once after operator delete with the original inst), DestructorHookIsPerInterpreter (two interpreters, no cross-map bleed), DestructorHookOptInZero (with on_destroy=nullptr the deleting-dtor slot is the verbatim copy from the original vtable), and DestructorHookSkippedOnCallerTeardown (race where teardown beats the wrapper, callback never fires) pin the four corners of the dtor-hook contract. Two new bench cells track dispatch-path cost. BM_XTU_OverlayExtraSlot at ~3.9 ns/call asserts >= 2x faster than BM_XTU_OverlayGlobalMap, the registry-lookup alternative. BM_XTU_OverlayWithDtorHook prints its per-call cost for trend visibility -- the wrapper patches a different vtable slot from the dispatched method's, so cost is structurally unchanged, but at sub-10 ns precision macOS CI jitter dominates any relative threshold. Both features are opt-in: n_extra_prefix_slots = 0 and on_destroy = nullptr (the defaults) preserve the prior behavior bit for bit.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 31, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.12%. Comparing base (6530e99) to head (0bb5693).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1019      +/-   ##
==========================================
+ Coverage   85.99%   86.12%   +0.12%     
==========================================
  Files          16       17       +1     
  Lines        5077     5116      +39     
==========================================
+ Hits         4366     4406      +40     
+ Misses        711      710       -1     
Files with missing lines Coverage Δ
include/CppInterOp/CppInterOp.h 100.00% <100.00%> (ø)
include/CppInterOp/CppInterOpThunks.h 100.00% <100.00%> (ø)
include/CppInterOp/CppInterOpTypes.h 93.87% <ø> (ø)
lib/CppInterOp/CppInterOp.cpp 89.50% <100.00%> (+0.16%) ⬆️
Files with missing lines Coverage Δ
include/CppInterOp/CppInterOp.h 100.00% <100.00%> (ø)
include/CppInterOp/CppInterOpThunks.h 100.00% <100.00%> (ø)
include/CppInterOp/CppInterOpTypes.h 93.87% <ø> (ø)
lib/CppInterOp/CppInterOp.cpp 89.50% <100.00%> (+0.16%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Collaborator

@Vipul-Cariappa Vipul-Cariappa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of this PR is as follows:
We introduce a mechanism to store some additional data on the vtable of the instance. These data are n void*. We also add an optional destructor overload. A custom destructor that the user wants to call after the call to the actual C++ destructor (I believe the custom dtor should be called before the actual dtor). We also add an argument to the overloaded destructor: void* for any additional data.
Please correct me if my understanding of things is wrong.

The commit message is horrible. Does not tell what is being done or why. Please make sure the commit message is firstly human readable. I also believe the comments in the source file are too verbose. But maybe it is required if someone needs to understand this code. But there are repetitive comments; these can be removed.

I cannot guarantee that the pointer arithmetic magic being does is correct; I expect there to be sufficient tests to catch any errors in it. FYI, I could not think of any edge cases that the current tests don't already cover.

Comment thread lib/CppInterOp/CppInterOp.cpp Outdated
Comment thread lib/CppInterOp/CppInterOp.td Outdated
Comment thread unittests/CppInterOp/VTableOverlayCrossTUBench.cpp Outdated
Comment thread unittests/CppInterOp/VTableOverlayCrossTUBench.cpp Outdated
Comment thread unittests/CppInterOp/VTableOverlayCrossTUBench.cpp
Comment on lines +196 to +210
static void BM_XTU_OverlayWithDtorHook(benchmark::State& state) {
Impl impl;
auto* base = ReflectOverlayBase();
void* fn = TestUtils::BitCastFn<void*>(&xtu_replacement);
Cpp::TCppConstFunction_t method = Frob(base);
auto* ov = Cpp::MakeVTableOverlay(
&impl, base, &method, &fn, /*n=*/1,
/*n_extra_prefix_slots=*/0,
/*on_destroy=*/&xtu_noop_cleanup,
/*cleanup_data=*/nullptr);
for (auto _ : state)
benchmark::DoNotOptimize(OverlayDispatchOnce(&impl, 7));
Cpp::DestroyVTableOverlay(ov);
}
BENCHMARK(BM_XTU_OverlayWithDtorHook);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are benchmarking this, but don't EXPECT_AT_LEAST_N_TIMES_FASTER, why?

Comment thread unittests/CppInterOp/VTableOverlayTest.cpp Outdated
Comment thread unittests/CppInterOp/VTableOverlayTest.cpp Outdated
Comment thread lib/CppInterOp/CppInterOp.cpp Outdated
Comment thread lib/CppInterOp/CppInterOp.cpp Outdated
Comment thread include/CppInterOp/CppInterOpThunks.h
Comment thread lib/CppInterOp/CppInterOp.cpp Outdated
Comment thread lib/CppInterOp/CppInterOp.cpp Outdated
Comment thread lib/CppInterOp/CppInterOp.cpp Outdated
Comment thread lib/CppInterOp/CppInterOp.cpp Outdated
Comment thread lib/CppInterOp/CppInterOp.cpp Outdated
Bindings layering on top of MakeVTableOverlay -- cppyy's Python
handlers, Lua closures, RooFit's per-object trampolines -- need two
things the initial primitive does not provide:

1. A place to stash per-instance state adjacent to the vtable, so a
   thunk reads it with a single fixed-offset load. Without this the
   binding falls back on a process-global registry keyed by `self`
   pointer -- hash + mutex per dispatch.

2. A tear-down moment tied to the C++ object's lifetime so the state
   can be released. Without this the binding either leaks (host
   refcounts on the wrapper pin it open and deadlock the GC) or
   threads cleanup through every potential destruction path:
   temporaries expiring at semicolon, container element drop,
   exception unwind, shared_ptr cycle break.

MakeVTableOverlay grows three new optional parameters. \c
n_extra_prefix_slots reserves nullptr-initialised \c void* cells
before the ABI prefix in the owned block; thunks reach them through
the inline \c VTableOverlayExtraSlot(self, i) helper. \c on_destroy
and \c cleanup_data install a per-instance trampoline at the
deleting-dtor slot (D0 on Itanium, the single deleting dtor on MSVC)
that runs the callback first -- while \c inst is still alive and its
state inspectable -- and then the original destructor. After the
original returns, the C++ memory has been freed. The callback must
not throw (running inside a C++ destructor) and must not destroy the
overlay (the original destructor still has to run after it returns).

Routing from `self` to the owning VTableOverlay goes through a hidden
self-pointer slot interposed between the user's extra prefix slots
and the ABI prefix. The wrapper reads it via a fixed-offset load from
`self`'s vptr -- no per-interpreter map, no mutex, no walk. The
public detail::kVTableOverlayPrefixSize accounts for the hidden slot
so the extra-slot offset formula in the inline helper is unchanged.

DestroyVTableOverlay is idempotent against both directions of the
race: a \c dtor_fired flag suppresses the vptr restore when the
wrapper has started, and caller-driven teardown restores the
original vptr so the wrapper is no longer reachable from \c inst.

Tests cover the four observable behaviours
(DestructorHookFires, DestructorHookIsPerInstance,
DestructorHookOptInZero, DestructorHookSkippedOnCallerTeardown)
plus SetsExtraPrefixSlots; the cross-TU bench adds
BM_XTU_OverlayExtraSlot vs BM_XTU_OverlayGlobalMap with an
EXPECT_AT_LEAST_N_TIMES_FASTER assertion pinning the
extra-slot vs registry lookup gap, plus a BM_XTU_OverlayWithDtorHook
trend cell (no assertion -- the dtor wrapper patches a slot the bench
does not dispatch through, so the per-call cost is structurally
unchanged).
Bindings layering a precompiled thunk catalog on top of VTableOverlay
need one dispatcher function per (signature, slot) cell in the table.
The body is mechanical -- look up the handler from a per-instance
location, marshal Args..., call the bound callable, marshal the
result. The marshaling is binding-specific (cppyy uses its converter
framework; a future Lua / R binding would use its own); the lookup is
also binding-specific (cppyy points at the extra prefix slot reserved
via MakeVTableOverlay's n_extra_prefix_slots).

CppInterOpThunks.h ships the variadic template the binding
instantiates per (R, Args..., Slot) cell. The binding contributes a
HandlerTraits struct with From(self) and
Call<R, Args...>(handler, slot, args...); CppInterOp owns only the
static dispatcher glue. Header-only and self-contained -- no
dependency on CppInterOp's runtime entry points.

Identical instantiations fold to the same .text, so a catalog of N
cells over M signatures costs N*M function pointers and at most M
distinct bodies. The variadic shape lets the binding extend the
catalog with new (R, Args...) tuples by listing more rows, not by
hand-writing more thunks.

CppInterOpThunksTest.cpp documents the contract end-to-end: a stand-in
HandlerTraits exercises three instantiations through dispatch and
verifies (a) the call forwards through From + Call with the expected
arguments, (b) distinct (Slot, R, Args...) tuples produce distinct
function-pointer addresses, and (c) the void-return / no-args shape
compiles and runs.
Copy link
Copy Markdown
Collaborator

@Vipul-Cariappa Vipul-Cariappa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't completely understand why CppImpl::Thunks::dispatch is required. But maybe I will understand it better once I see the associated cppyy PR.
Other than that, this looks good to me!

@vgvassilev vgvassilev merged commit f2fd9a9 into compiler-research:main Jun 3, 2026
42 of 43 checks passed
@vgvassilev vgvassilev deleted the vtable-overlay branch June 3, 2026 11:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants