|
| 1 | +# Allocator–OS Memory Residency Control Plan |
| 2 | + |
| 3 | +This document defines a low-level memory residency strategy for `ssh-chatter` without changing object layouts, message/feed limits, or allocator determinism. |
| 4 | + |
| 5 | +## 1) Span Layer at Allocator–OS Boundary |
| 6 | + |
| 7 | +Add a **span manager** below existing slab/arena logic. |
| 8 | + |
| 9 | +- **span**: page-aligned contiguous memory chunk (multiple pages, fixed span class sizes). |
| 10 | +- **slabs** are carved from spans for fixed-size allocations. |
| 11 | +- allocator metadata tracks span state, not object layout. |
| 12 | + |
| 13 | +### Span metadata (out-of-band) |
| 14 | + |
| 15 | +```c |
| 16 | +typedef enum { |
| 17 | + SPAN_HOT = 0, |
| 18 | + SPAN_COLD = 1, |
| 19 | + SPAN_FREE = 2, |
| 20 | + SPAN_MAPPED_FILE = 3 |
| 21 | +} span_state_t; |
| 22 | + |
| 23 | +typedef struct span { |
| 24 | + void *base; // page-aligned start |
| 25 | + size_t length; // bytes, multiple of page size |
| 26 | + uint32_t page_count; |
| 27 | + uint32_t inuse_objects; // live allocs in span |
| 28 | + uint32_t free_objects; |
| 29 | + uint64_t last_touch_ns; // monotonic timestamp |
| 30 | + uint32_t reclaim_epoch; |
| 31 | + bool no_hugepage_set; |
| 32 | + bool file_backed; |
| 33 | + int backing_fd; // -1 for anonymous |
| 34 | + off_t backing_off; |
| 35 | + span_state_t state; |
| 36 | + struct span *next; |
| 37 | +} span_t; |
| 38 | +``` |
| 39 | + |
| 40 | +Rules: |
| 41 | + |
| 42 | +1. `base` must be aligned to system page size. |
| 43 | +2. `length` must be a multiple of page size. |
| 44 | +3. metadata is allocated separately so existing app structs remain unchanged. |
| 45 | + |
| 46 | +## 2) Page-Span Tracking + Reclamation |
| 47 | + |
| 48 | +Track per-span occupancy and temperature. |
| 49 | + |
| 50 | +- On each alloc/free from slab: update `inuse_objects`, `free_objects`, `last_touch_ns`. |
| 51 | +- Reclamation only happens at span granularity (never partial object compaction). |
| 52 | + |
| 53 | +### Reclamation primitives |
| 54 | + |
| 55 | +- **Anonymous cold/free spans:** `madvise(base, length, MADV_DONTNEED)`. |
| 56 | +- **Large dedicated allocations:** `munmap(base, length)` when fully free and not pooled. |
| 57 | +- **Optional lazy reclaim mode:** `MADV_FREE` for non-critical latency profiles. |
| 58 | + |
| 59 | +### Core reclaim logic (pseudocode) |
| 60 | + |
| 61 | +```c |
| 62 | +void span_on_object_free(span_t *s) { |
| 63 | + s->inuse_objects--; |
| 64 | + s->free_objects++; |
| 65 | + s->last_touch_ns = now_monotonic_ns(); |
| 66 | + |
| 67 | + if (s->inuse_objects == 0) { |
| 68 | + s->state = SPAN_FREE; |
| 69 | + maybe_release_span(s, REASON_FULLY_FREE); |
| 70 | + } |
| 71 | +} |
| 72 | + |
| 73 | +void maybe_release_span(span_t *s, int reason) { |
| 74 | + if (!is_page_aligned(s->base) || !is_page_multiple(s->length)) return; |
| 75 | + |
| 76 | + if (s->file_backed) { |
| 77 | + // persistent content: keep mapping, let page cache reclaim naturally |
| 78 | + if (reason == REASON_COLD || reason == REASON_FULLY_FREE) { |
| 79 | + madvise(s->base, s->length, MADV_DONTNEED); // drop private clean pages |
| 80 | + } |
| 81 | + return; |
| 82 | + } |
| 83 | + |
| 84 | + if (is_large_dedicated_span(s) && s->inuse_objects == 0) { |
| 85 | + munmap(s->base, s->length); |
| 86 | + s->state = SPAN_FREE; |
| 87 | + remove_from_active_span_sets(s); |
| 88 | + enqueue_virtual_hole_for_remap(s->length); |
| 89 | + return; |
| 90 | + } |
| 91 | + |
| 92 | + if (s->inuse_objects == 0 || is_cold_enough(s)) { |
| 93 | + madvise(s->base, s->length, MADV_DONTNEED); |
| 94 | + s->state = (s->inuse_objects == 0) ? SPAN_FREE : SPAN_COLD; |
| 95 | + } |
| 96 | +} |
| 97 | +``` |
| 98 | +
|
| 99 | +## 3) Hot/Cold Arena Segmentation |
| 100 | +
|
| 101 | +For large text/message arenas, split into fixed segments (each segment == one or more spans). |
| 102 | +
|
| 103 | +- **hot segment set**: currently active window (recent chat, active sessions). |
| 104 | +- **cold segment set**: old history / infrequently read blocks. |
| 105 | +
|
| 106 | +Segment policy: |
| 107 | +
|
| 108 | +1. New writes go to current hot segment. |
| 109 | +2. Segment becomes cold after idle timeout and write cursor rotation. |
| 110 | +3. Cold segments are reclaimed with `MADV_DONTNEED` (anonymous) or file-backed mapping (preferred for persistence). |
| 111 | +4. Accessing cold segment is legal; page faults repopulate from zero-fill (anon) or file cache (mapped). |
| 112 | +
|
| 113 | +### Temperature scan loop (pseudocode) |
| 114 | +
|
| 115 | +```c |
| 116 | +void residency_maintenance_tick(void) { |
| 117 | + uint64_t now = now_monotonic_ns(); |
| 118 | +
|
| 119 | + for_each_span(s) { |
| 120 | + uint64_t idle_ns = now - s->last_touch_ns; |
| 121 | +
|
| 122 | + if (s->state == SPAN_HOT && idle_ns > COLD_IDLE_NS) { |
| 123 | + s->state = SPAN_COLD; |
| 124 | + maybe_release_span(s, REASON_COLD); |
| 125 | + } |
| 126 | +
|
| 127 | + if (s->state == SPAN_COLD && idle_ns > FREE_IDLE_NS && s->inuse_objects == 0) { |
| 128 | + maybe_release_span(s, REASON_FULLY_FREE); |
| 129 | + } |
| 130 | + } |
| 131 | +} |
| 132 | +``` |
| 133 | + |
| 134 | +Determinism is preserved because state transitions are threshold-based and run from a fixed periodic maintenance tick. |
| 135 | + |
| 136 | +## 4) File-Backed Mapping for Persistent Cold Data |
| 137 | + |
| 138 | +Move persistent, rarely written bulk storage (BBS posts, archived messages, feed history) to mmap-backed segment files. |
| 139 | + |
| 140 | +- `fd = open(..., O_RDWR|O_CREAT)` |
| 141 | +- `ftruncate(fd, segment_size)` |
| 142 | +- `ptr = mmap(NULL, segment_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, off)` |
| 143 | + |
| 144 | +Keep RAM-resident: |
| 145 | + |
| 146 | +- indexes |
| 147 | +- small recent-write window |
| 148 | +- segment directory metadata |
| 149 | + |
| 150 | +When cold: |
| 151 | + |
| 152 | +- `madvise(ptr, segment_size, MADV_DONTNEED)` to drop clean/unused resident pages. |
| 153 | +- data remains durable in file, and reload is page-cache driven. |
| 154 | + |
| 155 | +## 5) Prevent False Retention |
| 156 | + |
| 157 | +To avoid spans staying resident due to allocator internals: |
| 158 | + |
| 159 | +1. **Bound per-size-class free lists** by span count, not object count. |
| 160 | +2. Prefer freeing objects back into the same span until it becomes fully free. |
| 161 | +3. Disable cross-span stealing when a span is near-empty (to create reclaimable empty spans). |
| 162 | +4. Add optional **arena rotation**: new allocations avoid nearly-empty old spans once a fresh hot span exists. |
| 163 | + |
| 164 | +### Free-list trimming rule |
| 165 | + |
| 166 | +```c |
| 167 | +if (size_class_cache.spans_idle > IDLE_SPAN_LIMIT) { |
| 168 | + span_t *victim = pick_oldest_idle_span(size_class); |
| 169 | + if (victim->inuse_objects == 0) { |
| 170 | + maybe_release_span(victim, REASON_FULLY_FREE); |
| 171 | + unlink_from_size_class_cache(victim); |
| 172 | + } |
| 173 | +} |
| 174 | +``` |
| 175 | + |
| 176 | +## 6) Kernel Interaction Settings |
| 177 | + |
| 178 | +Apply mapping hints immediately after span creation: |
| 179 | + |
| 180 | +```c |
| 181 | +void span_post_map_init(span_t *s) { |
| 182 | + madvise(s->base, s->length, MADV_NOHUGEPAGE); // avoid THP 2MB pinning |
| 183 | +} |
| 184 | +``` |
| 185 | +
|
| 186 | +Reclaim mode selection: |
| 187 | +
|
| 188 | +- Default: `MADV_DONTNEED` for predictable immediate RSS drop. |
| 189 | +- Optional mode (`CHATTER_RESIDENCY_LAZY=1`): `MADV_FREE` for lower immediate CPU cost. |
| 190 | +
|
| 191 | +Large-span unmap threshold: |
| 192 | +
|
| 193 | +- if span is dedicated and `length >= 1 MiB` and fully free → `munmap`. |
| 194 | +
|
| 195 | +## 7) Trigger Matrix (Exact Rules) |
| 196 | +
|
| 197 | +1. **On free path** |
| 198 | + - If `inuse_objects == 0`: |
| 199 | + - dedicated large span: `munmap` |
| 200 | + - pooled span: `madvise(DONTNEED)` + keep metadata for deterministic remap |
| 201 | +2. **On maintenance tick (e.g., every 1s)** |
| 202 | + - if `state == HOT` and idle > `COLD_IDLE_NS` → mark COLD + `madvise(DONTNEED)` |
| 203 | + - if `state == COLD`, idle > `FREE_IDLE_NS`, and empty → release/unmap |
| 204 | +3. **On memory pressure signal (optional self-trigger)** |
| 205 | + - immediate pass over oldest cold spans until RSS budget reached. |
| 206 | +4. **On access fault/hit to cold span** |
| 207 | + - mark HOT, update `last_touch_ns`; no layout changes. |
| 208 | +
|
| 209 | +Suggested initial thresholds: |
| 210 | +
|
| 211 | +- `COLD_IDLE_NS = 30s` |
| 212 | +- `FREE_IDLE_NS = 300s` |
| 213 | +- maintenance interval = `1000ms` |
| 214 | +- dedicated unmap threshold = `1 MiB` |
| 215 | +
|
| 216 | +## 8) Observability / Verification |
| 217 | +
|
| 218 | +Use `/proc/<pid>/smaps_rollup` and allocator counters. |
| 219 | +
|
| 220 | +Track: |
| 221 | +
|
| 222 | +- `Rss` |
| 223 | +- `Anonymous` |
| 224 | +- `File` |
| 225 | +- `Private_Dirty` |
| 226 | +
|
| 227 | +Expected behavior after rollout: |
| 228 | +
|
| 229 | +- virtual size near current baseline |
| 230 | +- RSS converges toward active set (~120–220MB under moderate load) |
| 231 | +- flatter `Private_Dirty` growth due to aggressive cold-span eviction |
| 232 | +
|
| 233 | +## 9) Integration Notes for Existing `ssh-chatter` Memory Stack |
| 234 | +
|
| 235 | +Existing context/epoch ownership can remain intact. This design only adds a lower span residency layer: |
| 236 | +
|
| 237 | +- `ttak_fastalloc`/existing slab path unchanged at API level. |
| 238 | +- New hooks on alloc/free update span counters. |
| 239 | +- Maintenance thread/tick (already present in runtime loops) invokes `residency_maintenance_tick()`. |
| 240 | +- No struct ABI changes for application-visible objects. |
| 241 | +
|
| 242 | +This keeps deterministic allocation semantics while enabling active RSS control against long-lived idle memory. |
0 commit comments