Skip to content

feat: V3 chunk-splitter alignment + burst-load resilience#49

Open
jagajaga wants to merge 5 commits into
vrtmrz:mainfrom
jagajaga:feat/v3-chunk-splitter
Open

feat: V3 chunk-splitter alignment + burst-load resilience#49
jagajaga wants to merge 5 commits into
vrtmrz:mainfrom
jagajaga:feat/v3-chunk-splitter

Conversation

@jagajaga
Copy link
Copy Markdown

@jagajaga jagajaga commented May 31, 2026

Summary

Makes livesync-bridge work cleanly as a CouchDB↔filesystem relay against
LiveSync V3 (Fine Deduplication) vaults, and hardens it for the burst /
high-volume workloads that exposed several failure modes.

This supersedes #48 (adds the V3 write-path alignment + trystero removal on top
of those hardening commits).

What I found about "V3 decryption failures"

Many users (myself included) saw a storm of OperationError: Decryption failed
from SubtleCrypto.decrypt after enabling V3 on a client. I investigated against
a live ~78k-doc vault and found:

  • V3 reading already works with the pinned commonlib. V3 chunks use the same
    %= HKDF (E2EE V2) envelope as before; the pinned lib decrypts them fine.
    Verified three ways: 30/30 sampled chunks decrypt, note reassembly succeeds, and
    the bridge itself wrote 1326 files (markdown + binary, incl. post-V3 docs) with
    zero decrypt/corruption errors.
  • The decryption storm appears to be a transient salt-negotiation / stale-state
    issue during the V3 migration window
    , not a missing-format bug. Once the
    _local/...sync_parameters salt is settled, decryption is clean.

So this PR does not claim a decryption fix — reading already works.

What actually needed fixing for V3: the write path

When the bridge writes chunks back to CouchDB it used the old Rabin-Karp
splitter, producing boundaries that don't deduplicate against chunks written by
V3 clients. This PR bumps the commonlib submodule to a single-function backport
of splitPiecesRabinKarp from vrtmrz/livesync-commonlib@6abcea69 ("fixed: fine
deduplication now correctly fine") onto the pinned 0.25.25.

Validated against live data: the backported splitter reproduces stored chunk
IDs exactly (in order) for 8/8 real post-V3 notes
, and correctly does not
match older V2 notes.

Changes

  • main.ts — survive pouchdb-adapter-http's expected AsyncWrap node-fetch
    rejections under burst load; a sliding-window circuit breaker exits for a clean
    Docker restart if the changes-watch becomes permanently wedged.
  • PeerCouchDB.ts — data integrity: chunk hash verification + dedup + a
    persisted since checkpoint + compareDate conflict handling.
  • deno.jsonc — drop the trystero P2P dependency (the relay never uses it;
    previously stripped via a Dockerfile sed).
  • .gitmodules / lib — bump commonlib to the V3 Rabin-Karp backport.

Submodule note for maintainers

The splitter lives in livesync-commonlib. Upstream commonlib has moved ~191
commits ahead (service-locator refactor) since the bridge's pinned 0.25.25, so a
straight submodule bump breaks the bridge's API. I backported just the one
function and temporarily repointed the submodule URL to a personal fork
(jagajaga/livesync-commonlib@v3-rabin-karp-backport). You will likely prefer to
land the equivalent splitPiecesRabinKarp change in vrtmrz/livesync-commonlib
and repoint here. The upstream bridge has been dormant since 2025-09-17 while
commonlib moved on, which is why this is a targeted backport rather than a full
bump.

Test plan

  • deno install + deno run -A main.ts boots with no import errors
  • Reads a live V3 vault: 1326 files written, 0 Decryption failed / 0
    Corrupted document
  • Backported splitter reproduces stored chunk IDs 8/8 on real post-V3 notes
  • deno check lib/src/string_and_binary/chunks.ts adds no new type errors
  • Fresh git clone --recursive resolves the submodule + V3 splitter

jagajaga and others added 5 commits May 26, 2026 04:25
The bridge crashed with "Uncaught (in promise) TypeError: expected AsyncWrap"
when handling many filesystem events in rapid succession (e.g. bulk imports,
folder renames, or vault wipes via scanOfflineChanges).

Root cause: pouchdb-adapter-http uses node-fetch internally. Deno's
Node-compatibility shim for node-fetch occasionally produces socket-handle
state that fails AsyncWrap validation when concurrent HTTP requests hit
CouchDB. This is an upstream bug in the Deno/Node interop layer.

Without a global unhandled-rejection handler the error terminates the Deno
process; Docker's restart policy then brings the bridge back up, but it
begins watching "from now" and loses all in-flight filesystem events. The
result is CouchDB and disk drifting out of sync — docs present on one side
but missing from the other.

Two-part fix:

1. main.ts: install a globalThis.unhandledrejection handler that swallows
   the known AsyncWrap rejection (and logs any others) without exiting. A
   surviving process is strictly better than a restarted one for sync
   convergence — chokidar/scanOfflineChanges will re-deliver missed events.

2. Hub.ts: wrap peer.put/peer.delete in dispatch with bounded exponential
   backoff (100/200/400/800 ms, up to 4 attempts) for the transient HTTP
   errors that surface this bug (expected AsyncWrap, socket hang up,
   ECONNRESET, ETIMEDOUT). Most retries succeed on attempt 2 because the
   underlying socket is fresh. Non-transient errors still throw immediately.

Reproduces against vanilla bridge by importing ~1200 small markdown files
in <1 second via filesystem writes. Patched bridge processes the same
burst without restarting.
Four related fixes prompted by investigating a "A's body got overwritten with
B's content" report. None of these are by themselves a smoking gun for the
reported symptom, but together they harden the bridge against the most plausible
remaining mechanism (HTTP-level response cross-wire from the node-fetch
AsyncWrap bug under burst) and fix three latent issues that surfaced during the
audit.

PeerCouchDB: verify chunk integrity before dispatching.
  After getByMeta materializes an entry, recompute each chunk's hash and
  confirm it matches the ID stored in meta.children. Catches the case where a
  chunk read got cross-wired to another chunk's response payload — without
  this, the bridge would write the wrong bytes to disk under doc A's path and
  then push A's metadata pointing to chunks whose content no longer hashes to
  their ID. On mismatch we log NOTICE and drop the change; the next change for
  the same doc will retry after the chunk cache is refreshed.

PeerCouchDB: persist a real `since` checkpoint.
  beginWatch's checkIsInterested side-effected setSetting("since", this.man.since),
  but nothing ever advanced this.man.since past the value set at constructor
  time. On a watch error the .on("error") handler reconnects after 10 s using
  the same stale since, causing a full replay from process start each time. Now
  we update this.man.since (and persist it) from change.seq after each entry is
  processed, so reconnects resume where we left off.

PeerStorage: fix dedup-cache key mismatch.
  put()/delete() keyed isRepeating() by `lp` (baseDir-prefixed) while
  dispatch()/dispatchDeleted() used the relative path. With baseDir != "" the
  keys never collided, so the LRU never short-circuited the echo of our own
  writes. The echo was still caught by isChanged()/the CouchDB content check,
  but the cache was effectively dead. Use the global pathSrc on both sides.

Peer: fix compareDate int32 truncation.
  `~~(a?.mtime ?? 0 / 1000)` parses as `~~(a?.mtime ?? 0)` (precedence) and
  truncates to int32. For 2026-era ms timestamps that wraps past 2^31 and makes
  the comparison effectively random — breaking the "same-content within an
  hour, skip the write" optimization in PeerCouchDB.put. Return the delta in
  whole seconds via Math.floor((mtime ?? 0) / 1000), matching the caller's
  3600-second threshold.

Verified against a local couchdb: chunk poisoning is detected and the bridge
refuses the write rather than corrupting /vault. Normal sync (storage→couchdb
and couchdb→storage) is unaffected.
Two production-only fixes that the previous commit's `since` advance depends on
to actually work in the user's Docker setup.

main.ts: circuit breaker for the persistent AsyncWrap watch loop.

Observed in production: pouchdb-adapter-http's changes-feed retry chain (the
lib's .on("error") → setTimeout(10s) → beginWatch) hits AsyncWrap on every
single reconnect attempt, indefinitely. The previous patch's unhandled-rejection
swallow keeps the process alive, but the watch never recovers in-process —
the node-fetch socket pool stays broken. Only a fresh Deno process gets a clean
state.

Count AsyncWrap rejections in a 5-minute sliding window and exit when the
threshold (30) is reached. That's roughly "errors firing every 10s for 5
minutes," which is the signature of the broken-loop state and unambiguously
not a transient burst. Docker's restart policy brings us back clean; the
since checkpoint from the previous commit ensures we resume mid-stream
instead of replaying from "now".

PeerCouchDB: persist `since` and `remote-created` to a JSON file in dat/.

The since fix from the previous commit wrote to localStorage, which under the
user's compose lives at /deno-dir/location_data/<hash>/local_storage —
inside the container's ephemeral fs. Every Docker restart wipes it, so the
checkpoint never survived across the kind of process exit the new circuit
breaker triggers. The /app/dat volume IS mounted (the bridge-state named
volume), so we shadow the two checkpoints that matter for resume correctness
into dat/state-<peer-name>.json there. localStorage is still written as a
legacy shadow but is no longer authoritative.

`remote-created` had to come along because without it, the start() path
("Remote database looks like rebuilt. fetch from the first again.") fires on
every restart with a wiped localStorage and resets since=0 — undoing the
file-based checkpoint. Persisting both fixes that.

Also: localStorage reads/writes in start() are now wrapped (tryGetSetting /
trySetSetting) so a broken/partially-wiped backing store can't crash the
peer before beginWatch is even reached. State writes use a trailing-edge
500ms debounce so a burst of changes turns into one volume write per window,
not one per change.

Verified locally: state-server.json is written on each change, survives a
process restart with localStorage fully wiped, and the bridge resumes with
"Watch starting from <persisted seq>" instead of "looks like rebuilt".
The bridge runs as a CouchDB<->filesystem relay in Docker and never uses
the trystero P2P transport. It was previously stripped at image-build
time via `sed -i '/"trystero":/d' deno.jsonc` in the Dockerfile; move
that into the repo so a plain `deno install` / `deno task run` works
without the GitHub-hosted trystero import (which also avoids a needless
network dependency during install).
Point the commonlib submodule at the V3 Rabin-Karp backport
(jagajaga/livesync-commonlib@b354ef5, splitPiecesRabinKarp from
vrtmrz/livesync-commonlib@6abcea69 on top of the pinned 0.25.25).

This makes the chunks the bridge writes back to CouchDB use the same
content-defined boundaries as LiveSync V3 (Fine Deduplication) clients,
so write-back deduplicates against chunks produced by v0.25.65+.
Verified against live data: the new splitter reproduces stored chunk
IDs exactly (in order) for real post-V3 notes, and the bridge already
reads/decrypts V3 chunks correctly with the pinned library.

Note for upstream: the submodule URL is temporarily repointed to a
personal fork because the splitter change is a single-function backport
onto the pinned commonlib. Maintainers may prefer to land the
equivalent splitPiecesRabinKarp change in vrtmrz/livesync-commonlib and
repoint here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant