fix: port three upstream hardening follow-ups to Rust core#3
Merged
Conversation
Mirrors three small, isolated upstream Python commits to the Rust port.
Each fix is gated by regression tests that fail on the pre-fix code.
1. miner Windows hardening (upstream 5488e7b)
- SKIP_FILES gains pnpm-lock.yaml + yarn.lock so JS/TS lockfiles
don't get embedded.
- mine_file() refuses to embed any file that produces more than
MAX_CHUNKS_PER_FILE chunks (500) and prints a concrete skip
message naming the file. Catches the broader class of generated
artifacts (CSV/JSON dumps, build outputs, lockfiles not yet in
SKIP_FILES) that the named-file list cannot fully cover.
Upstream cited Windows ONNX 'bad_alloc' as the symptom this
prevents.
2. diary state backfill (upstream 2ff6283, follow-up to PR #2 fix #3)
- When a pre-Fix-#3 state file has size set but no content_hash,
the size-skip path correctly avoids re-ingesting the diary, but
also writes the missing hash back so subsequent runs use the
strict hash check. Without the backfill, post-upgrade same-size
edits would still slip through the legacy code path forever.
3. exporter per-file symlink check (upstream 7545238, follow-up to
PR #2 fix #2)
- safe_open_for_write() opens room files and index.md with
O_NOFOLLOW on POSIX so the open itself fails ELOOP if a symlink
was swapped in between create-dir and file-open. Closes the
TOCTOU window the directory-level reject_symlink leaves open.
Windows path falls back to a symlink_metadata pre-check.
Tests: 386/386 passing locally (+6 new regression tests).
Out of scope (analyzed, do not apply to Rust):
- 5134a63 SQLite preflight: Rust palace uses HNSW, not chromadb.
- d5ce97c palace lock byte-0 sentinel: Rust uses file-existence locks.
- ef8d83c lock holder + non-zero exit: already implemented.
- 71804c0 / 3a76360 hooks Popen detach + PID guard: Rust hooks emit
JSON only, do not spawn background processes.
ingest_diaries canonicalizes diary_dir internally (resolve_path).
On macOS /tmp -> /private/tmp and on Windows the verbatim UNC form,
both differ from the raw temp.path().join('diaries') used in the
regression test — so state_file_for produced a different SHA256 key
and read_to_string returned ENOENT.
Look the state file up via the same canonical form ingest_diaries
actually used. Linux /tmp didn't symlink, which is why local CI
passed but macOS/Windows didn't.
Test result: 386/386 passing locally on Linux; macOS/Windows
expected to pass on next CI run.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three small, isolated upstream Python fixes ported to the Rust core. All three are defensive/hardening — they sit on top of (or alongside) the four fixes already shipped in PR #2.
5488e7b(upstream)crates/core/src/miner.rsSKIP_FILES, build output) emits thousands of chunks in one batch. Triggers ONNXbad_allocon Windows; otherwise just silently bloats the palace.2ff6283(upstream)crates/core/src/diary_ingest.rssizeonly. The legacy size-skip path correctly avoided a re-ingest, but never wrote the missingcontent_hashback, so post-upgrade same-size edits would slip through the legacy path indefinitely.7545238(upstream)crates/core/src/exporter.rsreject_symlinkcheck (PR #2 Fix #2) and the per-file open. A symlink swapped in during that window would still redirect writes.What changed
1. Miner Windows hardening (
5488e7b)SKIP_FILESgainspnpm-lock.yamlandyarn.lock(the JS/TS lockfiles upstream now skips).MAX_CHUNKS_PER_FILE = 500constant.mine_file()checks the chunk count and refuses to embed any single file that exceeds the cap, printing a concrete skip message naming the file. This catches the broader class of generated artifacts that a named-file list cannot fully cover.2. Diary state backfill (
2ff6283)content_hashin prior state, size unchanged),ingest_diariesnow writes back a state entry with the freshly-computed hash and persists the state file even if no drawer was rewritten. Subsequent runs use the strict hash check.3. Exporter per-file symlink check (
7545238)safe_open_for_write(path, append)helper used for both room files andindex.md. POSIX path usesO_NOFOLLOW(OpenOptions::custom_flags(libc::O_NOFOLLOW)) so the open itself fails withELOOPon a symlinked target. Windows path falls back tosymlink_metadatapre-check (narrower than the no-check baseline).What was NOT ported, and why
These four upstream commits were inspected and intentionally skipped because the underlying mechanism does not exist in the Rust port:
5134a63SQLite integrity preflightembedvec), not chromadb/SQLite — no equivalent failure mode.d5ce97cpalace lock byte-0 sentinelmsvcrt.lockingbyte-range locks. The Windows bugs the sentinel addresses (current-position locking, byte-0 read-block) do not exist here.ef8d83clock holder + non-zero exitMineAlreadyRunninginmine_palace_lock.rscarries the holder PID andcli.rscallsstd::process::exit(1)on contention.71804c0/3a76360hooksPopendetach + per-target PID guardhooks_cli.rsonly emits JSON for the harness; it does not spawn background mining processes. The parent-blocked-on-child symptom and the per-target guard aroundPopenhave no surface here.Review & Testing Checklist for Human
miner.rs).{"size": N}but nocontent_hash, content unchanged on disk, second run should write the hash back to disk while still printing nothing.<output>/<wing>/<room>.mdpointing at e.g./etc/passwd, runmpr export, verify it errors out before writing instead of redirecting.Notes
libcwas already a workspace dep used elsewhere in the crate.O_NOFOLLOWpath usesOpenOptions::custom_flagsrather than the unstableio::ErrorKind::FilesystemLoop(only checksraw_os_error() == Some(libc::ELOOP)).