feat: V3 chunk-splitter alignment + burst-load resilience#49
Open
jagajaga wants to merge 5 commits into
Open
Conversation
The bridge crashed with "Uncaught (in promise) TypeError: expected AsyncWrap" when handling many filesystem events in rapid succession (e.g. bulk imports, folder renames, or vault wipes via scanOfflineChanges). Root cause: pouchdb-adapter-http uses node-fetch internally. Deno's Node-compatibility shim for node-fetch occasionally produces socket-handle state that fails AsyncWrap validation when concurrent HTTP requests hit CouchDB. This is an upstream bug in the Deno/Node interop layer. Without a global unhandled-rejection handler the error terminates the Deno process; Docker's restart policy then brings the bridge back up, but it begins watching "from now" and loses all in-flight filesystem events. The result is CouchDB and disk drifting out of sync — docs present on one side but missing from the other. Two-part fix: 1. main.ts: install a globalThis.unhandledrejection handler that swallows the known AsyncWrap rejection (and logs any others) without exiting. A surviving process is strictly better than a restarted one for sync convergence — chokidar/scanOfflineChanges will re-deliver missed events. 2. Hub.ts: wrap peer.put/peer.delete in dispatch with bounded exponential backoff (100/200/400/800 ms, up to 4 attempts) for the transient HTTP errors that surface this bug (expected AsyncWrap, socket hang up, ECONNRESET, ETIMEDOUT). Most retries succeed on attempt 2 because the underlying socket is fresh. Non-transient errors still throw immediately. Reproduces against vanilla bridge by importing ~1200 small markdown files in <1 second via filesystem writes. Patched bridge processes the same burst without restarting.
Four related fixes prompted by investigating a "A's body got overwritten with
B's content" report. None of these are by themselves a smoking gun for the
reported symptom, but together they harden the bridge against the most plausible
remaining mechanism (HTTP-level response cross-wire from the node-fetch
AsyncWrap bug under burst) and fix three latent issues that surfaced during the
audit.
PeerCouchDB: verify chunk integrity before dispatching.
After getByMeta materializes an entry, recompute each chunk's hash and
confirm it matches the ID stored in meta.children. Catches the case where a
chunk read got cross-wired to another chunk's response payload — without
this, the bridge would write the wrong bytes to disk under doc A's path and
then push A's metadata pointing to chunks whose content no longer hashes to
their ID. On mismatch we log NOTICE and drop the change; the next change for
the same doc will retry after the chunk cache is refreshed.
PeerCouchDB: persist a real `since` checkpoint.
beginWatch's checkIsInterested side-effected setSetting("since", this.man.since),
but nothing ever advanced this.man.since past the value set at constructor
time. On a watch error the .on("error") handler reconnects after 10 s using
the same stale since, causing a full replay from process start each time. Now
we update this.man.since (and persist it) from change.seq after each entry is
processed, so reconnects resume where we left off.
PeerStorage: fix dedup-cache key mismatch.
put()/delete() keyed isRepeating() by `lp` (baseDir-prefixed) while
dispatch()/dispatchDeleted() used the relative path. With baseDir != "" the
keys never collided, so the LRU never short-circuited the echo of our own
writes. The echo was still caught by isChanged()/the CouchDB content check,
but the cache was effectively dead. Use the global pathSrc on both sides.
Peer: fix compareDate int32 truncation.
`~~(a?.mtime ?? 0 / 1000)` parses as `~~(a?.mtime ?? 0)` (precedence) and
truncates to int32. For 2026-era ms timestamps that wraps past 2^31 and makes
the comparison effectively random — breaking the "same-content within an
hour, skip the write" optimization in PeerCouchDB.put. Return the delta in
whole seconds via Math.floor((mtime ?? 0) / 1000), matching the caller's
3600-second threshold.
Verified against a local couchdb: chunk poisoning is detected and the bridge
refuses the write rather than corrupting /vault. Normal sync (storage→couchdb
and couchdb→storage) is unaffected.
Two production-only fixes that the previous commit's `since` advance depends on
to actually work in the user's Docker setup.
main.ts: circuit breaker for the persistent AsyncWrap watch loop.
Observed in production: pouchdb-adapter-http's changes-feed retry chain (the
lib's .on("error") → setTimeout(10s) → beginWatch) hits AsyncWrap on every
single reconnect attempt, indefinitely. The previous patch's unhandled-rejection
swallow keeps the process alive, but the watch never recovers in-process —
the node-fetch socket pool stays broken. Only a fresh Deno process gets a clean
state.
Count AsyncWrap rejections in a 5-minute sliding window and exit when the
threshold (30) is reached. That's roughly "errors firing every 10s for 5
minutes," which is the signature of the broken-loop state and unambiguously
not a transient burst. Docker's restart policy brings us back clean; the
since checkpoint from the previous commit ensures we resume mid-stream
instead of replaying from "now".
PeerCouchDB: persist `since` and `remote-created` to a JSON file in dat/.
The since fix from the previous commit wrote to localStorage, which under the
user's compose lives at /deno-dir/location_data/<hash>/local_storage —
inside the container's ephemeral fs. Every Docker restart wipes it, so the
checkpoint never survived across the kind of process exit the new circuit
breaker triggers. The /app/dat volume IS mounted (the bridge-state named
volume), so we shadow the two checkpoints that matter for resume correctness
into dat/state-<peer-name>.json there. localStorage is still written as a
legacy shadow but is no longer authoritative.
`remote-created` had to come along because without it, the start() path
("Remote database looks like rebuilt. fetch from the first again.") fires on
every restart with a wiped localStorage and resets since=0 — undoing the
file-based checkpoint. Persisting both fixes that.
Also: localStorage reads/writes in start() are now wrapped (tryGetSetting /
trySetSetting) so a broken/partially-wiped backing store can't crash the
peer before beginWatch is even reached. State writes use a trailing-edge
500ms debounce so a burst of changes turns into one volume write per window,
not one per change.
Verified locally: state-server.json is written on each change, survives a
process restart with localStorage fully wiped, and the bridge resumes with
"Watch starting from <persisted seq>" instead of "looks like rebuilt".
The bridge runs as a CouchDB<->filesystem relay in Docker and never uses the trystero P2P transport. It was previously stripped at image-build time via `sed -i '/"trystero":/d' deno.jsonc` in the Dockerfile; move that into the repo so a plain `deno install` / `deno task run` works without the GitHub-hosted trystero import (which also avoids a needless network dependency during install).
Point the commonlib submodule at the V3 Rabin-Karp backport (jagajaga/livesync-commonlib@b354ef5, splitPiecesRabinKarp from vrtmrz/livesync-commonlib@6abcea69 on top of the pinned 0.25.25). This makes the chunks the bridge writes back to CouchDB use the same content-defined boundaries as LiveSync V3 (Fine Deduplication) clients, so write-back deduplicates against chunks produced by v0.25.65+. Verified against live data: the new splitter reproduces stored chunk IDs exactly (in order) for real post-V3 notes, and the bridge already reads/decrypts V3 chunks correctly with the pinned library. Note for upstream: the submodule URL is temporarily repointed to a personal fork because the splitter change is a single-function backport onto the pinned commonlib. Maintainers may prefer to land the equivalent splitPiecesRabinKarp change in vrtmrz/livesync-commonlib and repoint here.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Makes
livesync-bridgework cleanly as a CouchDB↔filesystem relay againstLiveSync V3 (Fine Deduplication) vaults, and hardens it for the burst /
high-volume workloads that exposed several failure modes.
This supersedes #48 (adds the V3 write-path alignment + trystero removal on top
of those hardening commits).
What I found about "V3 decryption failures"
Many users (myself included) saw a storm of
OperationError: Decryption failedfrom
SubtleCrypto.decryptafter enabling V3 on a client. I investigated againsta live ~78k-doc vault and found:
%=HKDF (E2EE V2) envelope as before; the pinned lib decrypts them fine.Verified three ways: 30/30 sampled chunks decrypt, note reassembly succeeds, and
the bridge itself wrote 1326 files (markdown + binary, incl. post-V3 docs) with
zero decrypt/corruption errors.
issue during the V3 migration window, not a missing-format bug. Once the
_local/...sync_parameterssalt is settled, decryption is clean.So this PR does not claim a decryption fix — reading already works.
What actually needed fixing for V3: the write path
When the bridge writes chunks back to CouchDB it used the old Rabin-Karp
splitter, producing boundaries that don't deduplicate against chunks written by
V3 clients. This PR bumps the commonlib submodule to a single-function backport
of
splitPiecesRabinKarpfrom vrtmrz/livesync-commonlib@6abcea69 ("fixed: finededuplication now correctly fine") onto the pinned 0.25.25.
Validated against live data: the backported splitter reproduces stored chunk
IDs exactly (in order) for 8/8 real post-V3 notes, and correctly does not
match older V2 notes.
Changes
expected AsyncWrapnode-fetchrejections under burst load; a sliding-window circuit breaker exits for a clean
Docker restart if the changes-watch becomes permanently wedged.
persisted
sincecheckpoint +compareDateconflict handling.previously stripped via a Dockerfile
sed).Submodule note for maintainers
The splitter lives in
livesync-commonlib. Upstream commonlib has moved ~191commits ahead (service-locator refactor) since the bridge's pinned 0.25.25, so a
straight submodule bump breaks the bridge's API. I backported just the one
function and temporarily repointed the submodule URL to a personal fork
(
jagajaga/livesync-commonlib@v3-rabin-karp-backport). You will likely prefer toland the equivalent
splitPiecesRabinKarpchange invrtmrz/livesync-commonliband repoint here. The upstream bridge has been dormant since 2025-09-17 while
commonlib moved on, which is why this is a targeted backport rather than a full
bump.
Test plan
deno install+deno run -A main.tsboots with no import errorsDecryption failed/ 0Corrupted documentdeno check lib/src/string_and_binary/chunks.tsadds no new type errorsgit clone --recursiveresolves the submodule + V3 splitter