Skip to content

Commit f5d38d7

Browse files
committed
Add allocator-OS residency control design document
1 parent dabf11c commit f5d38d7

1 file changed

Lines changed: 242 additions & 0 deletions

File tree

Lines changed: 242 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,242 @@
1+
# Allocator–OS Memory Residency Control Plan
2+
3+
This document defines a low-level memory residency strategy for `ssh-chatter` without changing object layouts, message/feed limits, or allocator determinism.
4+
5+
## 1) Span Layer at Allocator–OS Boundary
6+
7+
Add a **span manager** below existing slab/arena logic.
8+
9+
- **span**: page-aligned contiguous memory chunk (multiple pages, fixed span class sizes).
10+
- **slabs** are carved from spans for fixed-size allocations.
11+
- allocator metadata tracks span state, not object layout.
12+
13+
### Span metadata (out-of-band)
14+
15+
```c
16+
typedef enum {
17+
SPAN_HOT = 0,
18+
SPAN_COLD = 1,
19+
SPAN_FREE = 2,
20+
SPAN_MAPPED_FILE = 3
21+
} span_state_t;
22+
23+
typedef struct span {
24+
void *base; // page-aligned start
25+
size_t length; // bytes, multiple of page size
26+
uint32_t page_count;
27+
uint32_t inuse_objects; // live allocs in span
28+
uint32_t free_objects;
29+
uint64_t last_touch_ns; // monotonic timestamp
30+
uint32_t reclaim_epoch;
31+
bool no_hugepage_set;
32+
bool file_backed;
33+
int backing_fd; // -1 for anonymous
34+
off_t backing_off;
35+
span_state_t state;
36+
struct span *next;
37+
} span_t;
38+
```
39+
40+
Rules:
41+
42+
1. `base` must be aligned to system page size.
43+
2. `length` must be a multiple of page size.
44+
3. metadata is allocated separately so existing app structs remain unchanged.
45+
46+
## 2) Page-Span Tracking + Reclamation
47+
48+
Track per-span occupancy and temperature.
49+
50+
- On each alloc/free from slab: update `inuse_objects`, `free_objects`, `last_touch_ns`.
51+
- Reclamation only happens at span granularity (never partial object compaction).
52+
53+
### Reclamation primitives
54+
55+
- **Anonymous cold/free spans:** `madvise(base, length, MADV_DONTNEED)`.
56+
- **Large dedicated allocations:** `munmap(base, length)` when fully free and not pooled.
57+
- **Optional lazy reclaim mode:** `MADV_FREE` for non-critical latency profiles.
58+
59+
### Core reclaim logic (pseudocode)
60+
61+
```c
62+
void span_on_object_free(span_t *s) {
63+
s->inuse_objects--;
64+
s->free_objects++;
65+
s->last_touch_ns = now_monotonic_ns();
66+
67+
if (s->inuse_objects == 0) {
68+
s->state = SPAN_FREE;
69+
maybe_release_span(s, REASON_FULLY_FREE);
70+
}
71+
}
72+
73+
void maybe_release_span(span_t *s, int reason) {
74+
if (!is_page_aligned(s->base) || !is_page_multiple(s->length)) return;
75+
76+
if (s->file_backed) {
77+
// persistent content: keep mapping, let page cache reclaim naturally
78+
if (reason == REASON_COLD || reason == REASON_FULLY_FREE) {
79+
madvise(s->base, s->length, MADV_DONTNEED); // drop private clean pages
80+
}
81+
return;
82+
}
83+
84+
if (is_large_dedicated_span(s) && s->inuse_objects == 0) {
85+
munmap(s->base, s->length);
86+
s->state = SPAN_FREE;
87+
remove_from_active_span_sets(s);
88+
enqueue_virtual_hole_for_remap(s->length);
89+
return;
90+
}
91+
92+
if (s->inuse_objects == 0 || is_cold_enough(s)) {
93+
madvise(s->base, s->length, MADV_DONTNEED);
94+
s->state = (s->inuse_objects == 0) ? SPAN_FREE : SPAN_COLD;
95+
}
96+
}
97+
```
98+
99+
## 3) Hot/Cold Arena Segmentation
100+
101+
For large text/message arenas, split into fixed segments (each segment == one or more spans).
102+
103+
- **hot segment set**: currently active window (recent chat, active sessions).
104+
- **cold segment set**: old history / infrequently read blocks.
105+
106+
Segment policy:
107+
108+
1. New writes go to current hot segment.
109+
2. Segment becomes cold after idle timeout and write cursor rotation.
110+
3. Cold segments are reclaimed with `MADV_DONTNEED` (anonymous) or file-backed mapping (preferred for persistence).
111+
4. Accessing cold segment is legal; page faults repopulate from zero-fill (anon) or file cache (mapped).
112+
113+
### Temperature scan loop (pseudocode)
114+
115+
```c
116+
void residency_maintenance_tick(void) {
117+
uint64_t now = now_monotonic_ns();
118+
119+
for_each_span(s) {
120+
uint64_t idle_ns = now - s->last_touch_ns;
121+
122+
if (s->state == SPAN_HOT && idle_ns > COLD_IDLE_NS) {
123+
s->state = SPAN_COLD;
124+
maybe_release_span(s, REASON_COLD);
125+
}
126+
127+
if (s->state == SPAN_COLD && idle_ns > FREE_IDLE_NS && s->inuse_objects == 0) {
128+
maybe_release_span(s, REASON_FULLY_FREE);
129+
}
130+
}
131+
}
132+
```
133+
134+
Determinism is preserved because state transitions are threshold-based and run from a fixed periodic maintenance tick.
135+
136+
## 4) File-Backed Mapping for Persistent Cold Data
137+
138+
Move persistent, rarely written bulk storage (BBS posts, archived messages, feed history) to mmap-backed segment files.
139+
140+
- `fd = open(..., O_RDWR|O_CREAT)`
141+
- `ftruncate(fd, segment_size)`
142+
- `ptr = mmap(NULL, segment_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, off)`
143+
144+
Keep RAM-resident:
145+
146+
- indexes
147+
- small recent-write window
148+
- segment directory metadata
149+
150+
When cold:
151+
152+
- `madvise(ptr, segment_size, MADV_DONTNEED)` to drop clean/unused resident pages.
153+
- data remains durable in file, and reload is page-cache driven.
154+
155+
## 5) Prevent False Retention
156+
157+
To avoid spans staying resident due to allocator internals:
158+
159+
1. **Bound per-size-class free lists** by span count, not object count.
160+
2. Prefer freeing objects back into the same span until it becomes fully free.
161+
3. Disable cross-span stealing when a span is near-empty (to create reclaimable empty spans).
162+
4. Add optional **arena rotation**: new allocations avoid nearly-empty old spans once a fresh hot span exists.
163+
164+
### Free-list trimming rule
165+
166+
```c
167+
if (size_class_cache.spans_idle > IDLE_SPAN_LIMIT) {
168+
span_t *victim = pick_oldest_idle_span(size_class);
169+
if (victim->inuse_objects == 0) {
170+
maybe_release_span(victim, REASON_FULLY_FREE);
171+
unlink_from_size_class_cache(victim);
172+
}
173+
}
174+
```
175+
176+
## 6) Kernel Interaction Settings
177+
178+
Apply mapping hints immediately after span creation:
179+
180+
```c
181+
void span_post_map_init(span_t *s) {
182+
madvise(s->base, s->length, MADV_NOHUGEPAGE); // avoid THP 2MB pinning
183+
}
184+
```
185+
186+
Reclaim mode selection:
187+
188+
- Default: `MADV_DONTNEED` for predictable immediate RSS drop.
189+
- Optional mode (`CHATTER_RESIDENCY_LAZY=1`): `MADV_FREE` for lower immediate CPU cost.
190+
191+
Large-span unmap threshold:
192+
193+
- if span is dedicated and `length >= 1 MiB` and fully free → `munmap`.
194+
195+
## 7) Trigger Matrix (Exact Rules)
196+
197+
1. **On free path**
198+
- If `inuse_objects == 0`:
199+
- dedicated large span: `munmap`
200+
- pooled span: `madvise(DONTNEED)` + keep metadata for deterministic remap
201+
2. **On maintenance tick (e.g., every 1s)**
202+
- if `state == HOT` and idle > `COLD_IDLE_NS` → mark COLD + `madvise(DONTNEED)`
203+
- if `state == COLD`, idle > `FREE_IDLE_NS`, and empty → release/unmap
204+
3. **On memory pressure signal (optional self-trigger)**
205+
- immediate pass over oldest cold spans until RSS budget reached.
206+
4. **On access fault/hit to cold span**
207+
- mark HOT, update `last_touch_ns`; no layout changes.
208+
209+
Suggested initial thresholds:
210+
211+
- `COLD_IDLE_NS = 30s`
212+
- `FREE_IDLE_NS = 300s`
213+
- maintenance interval = `1000ms`
214+
- dedicated unmap threshold = `1 MiB`
215+
216+
## 8) Observability / Verification
217+
218+
Use `/proc/<pid>/smaps_rollup` and allocator counters.
219+
220+
Track:
221+
222+
- `Rss`
223+
- `Anonymous`
224+
- `File`
225+
- `Private_Dirty`
226+
227+
Expected behavior after rollout:
228+
229+
- virtual size near current baseline
230+
- RSS converges toward active set (~120–220MB under moderate load)
231+
- flatter `Private_Dirty` growth due to aggressive cold-span eviction
232+
233+
## 9) Integration Notes for Existing `ssh-chatter` Memory Stack
234+
235+
Existing context/epoch ownership can remain intact. This design only adds a lower span residency layer:
236+
237+
- `ttak_fastalloc`/existing slab path unchanged at API level.
238+
- New hooks on alloc/free update span counters.
239+
- Maintenance thread/tick (already present in runtime loops) invokes `residency_maintenance_tick()`.
240+
- No struct ABI changes for application-visible objects.
241+
242+
This keeps deterministic allocation semantics while enabling active RSS control against long-lived idle memory.

0 commit comments

Comments
 (0)