Skip to content

Add recursive directory synchronization (AI one-shot)#9

Draft
messa wants to merge 5 commits intomainfrom
claude/plan-recursive-sync-uO2UZ
Draft

Add recursive directory synchronization (AI one-shot)#9
messa wants to merge 5 commits intomainfrom
claude/plan-recursive-sync-uO2UZ

Conversation

@messa
Copy link
Copy Markdown
Owner

@messa messa commented Jan 26, 2026

This is a completely AI one-shotted design plan + implementation.

There was only one intervention - to steer it away from completely removing the block-checksum-based incremental copy :)


Summary

This PR adds comprehensive design documentation for implementing recursive directory synchronization (--recursive flag) in blockcopy, along with a minor flake8 configuration adjustment.

Changes

Documentation

  • add_recursive_plan.md: New 558-line design document detailing the complete implementation plan for recursive directory synchronization
    • Overview of current blockcopy functionality and proposed usage
    • Detailed architecture with three phases: file list exchange, file transfer, and metadata application
    • Binary protocol extensions with new commands (list, file, dirs, link, ends, getc, mkdr, syml, dele)
    • Step-by-step synchronization flow with Python pseudocode
    • New CLI arguments for --recursive, --delete, --exclude, --include, and --dry-run
    • Comprehensive test plan covering unit tests, integration tests, and edge cases
    • Implementation roadmap across 5 stages
    • Discussion of backward compatibility, version detection, and risk mitigation

Configuration

  • .flake8: Increased max-complexity threshold from 70 to 80 to accommodate the more complex recursive implementation

Implementation Details

The design document establishes:

  • Protocol changes: New binary commands for directory listing, file metadata, and deletion operations
  • Three-stage synchronization: List exchange → change detection → atomic application
  • Fast path optimization: Quick comparison using mtime/size before full checksum validation
  • Special case handling: Empty directories, symlinks, hardlinks, sparse files, and permission preservation
  • Comprehensive testing strategy: 30+ test cases covering normal operations, edge cases, and error conditions
  • Phased rollout: 5 implementation stages from basic infrastructure to advanced filtering

This document serves as the specification for the recursive feature implementation and will guide development efforts.

Design document covering:
- Protocol extensions for directory listing (list, file, dirs, link, ends, getc)
- Protocol extensions for file transfer (mkdr, syml, dele)
- Detailed workflow for checksum/retrieve/save stages
- New CLI options (--recursive, --delete, --exclude, --include, --dry-run)
- Edge case handling (symlinks, empty dirs, special files, permissions)
- Comprehensive test plan (unit tests, integration tests, edge cases)
- Implementation roadmap in 5 stages
Add support for synchronizing entire directory trees with --recursive flag:

Features:
- Scan and compare directory structures on source and destination
- Transfer only changed files (based on size/mtime comparison)
- Create missing directories on destination
- Preserve symlinks (copy link target, not content)
- Support --delete flag to remove extra files on destination
- Full metadata preservation support (permissions, timestamps, owner/group)
- LZMA compression support for transferred data

New protocol commands:
- list/ends: Directory listing exchange
- file/dirs/link: Entry types in listing
- FILE/fend: File data transfer markers
- mkdr/syml: Create directories and symlinks
- dele: Delete files/directories (with --delete)

Implementation uses unidirectional protocol compatible with pipeline:
  blockcopy checksum -r /dst | ssh src blockcopy retrieve -r /src | blockcopy save -r /dst

Tests cover: empty dirs, nested structures, symlinks, large files,
incremental updates, deletion, special characters in filenames.
Changed the recursive mode protocol to use block-level checksums
instead of transferring entire files. Now checksum sends SHA3-512
hashes for each 128KB block of every file, and retrieve compares
them to send only changed blocks.

This brings the recursive mode in line with the original single-file
mode's efficiency - only modified portions of files are transferred.

Added test_recursive_block_level_diff to verify the behavior.
Copilot AI review requested due to automatic review settings January 26, 2026 14:15
@messa messa marked this pull request as draft January 26, 2026 14:16
@messa messa changed the title Add recursive directory synchronization design document Add recursive directory synchronization (AI one-shot) Jan 26, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive design documentation and implementation for recursive directory synchronization to blockcopy, enabling it to sync entire directory trees similar to rsync -r.

Changes:

  • Added 558-line design document (in Czech) detailing the recursive sync architecture, protocol extensions, and implementation roadmap
  • Implemented recursive directory synchronization with three new functions: do_checksum_recursive, do_retrieve_recursive, and do_save_recursive
  • Added comprehensive test suite with 15 integration tests covering basic operations, incremental sync, symlinks, deletion, and block-level diff
  • Bumped version from 0.0.2 to 0.0.3 and increased flake8 max-complexity threshold

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
add_recursive_plan.md Design document describing the recursive sync architecture, binary protocol extensions, and implementation plan
blockcopy.py Core implementation adding data classes for entries, directory scanning, serialization functions, and three recursive mode functions
tests/test_recursive.py Comprehensive test suite with 15 tests covering recursive synchronization scenarios
uv.lock Version bump from 0.0.2 to 0.0.3
.flake8 Increased max-complexity from 70 to 80 to accommodate recursive implementation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread blockcopy.py
Comment on lines +1227 to +1231
except OSError as e:
logger.warning('Failed to read file %s for hashing: %s', file_path, e)

# End of file hashes
output_stream.write(b'fend')
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a file read fails during hashing (line 1227), the code logs a warning but continues to send 'fend' for that file. This could cause the retrieve side to receive a file entry with no hashes, which may not be handled correctly. Consider either skipping the file entirely (not sending the file entry) or sending an error marker so the retrieve side can handle the missing file appropriately.

Copilot uses AI. Check for mistakes.
Comment thread blockcopy.py
Comment on lines +1598 to +1600
# Open file for writing (create if doesn't exist)
if full_path.exists():
f = open(full_path, 'r+b')
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When receiving a file (FILE command), the code doesn't handle the case where a directory or symlink already exists at that path. The code tries to open() the path (line 1600), which will fail if it's a directory. The symlink handler (line 1569-1570) correctly removes existing files/symlinks before creating a new symlink, but the FILE handler should do the same to handle type conflicts (e.g., when a directory on the destination needs to be replaced with a file from the source).

Suggested change
# Open file for writing (create if doesn't exist)
if full_path.exists():
f = open(full_path, 'r+b')
# Open file for writing (create if doesn't exist).
# If a non-regular file (e.g. directory or symlink) exists at this path,
# remove it first so that type conflicts are handled correctly.
if full_path.exists():
if full_path.is_file():
f = open(full_path, 'r+b')
else:
# Remove existing non-file (directory, symlink, etc.)
if full_path.is_dir() and not full_path.is_symlink():
# Recursively delete directory contents
for p in sorted(full_path.rglob('*'), reverse=True):
if p.is_file() or p.is_symlink():
p.unlink()
else:
p.rmdir()
full_path.rmdir()
else:
full_path.unlink()
f = open(full_path, 'wb')

Copilot uses AI. Check for mistakes.
Comment thread blockcopy.py
Comment on lines +1329 to +1367
# Send directories to create (sorted by path to create parents first)
for path in sorted(src_dirs_dict.keys()):
if path not in dst_dirs:
entry = src_dirs_dict[path]
output_stream.write(b'mkdr')
path_bytes = path.encode('utf-8')
owner_bytes = entry.owner_name.encode('utf-8')
group_bytes = entry.group_name.encode('utf-8')
output_stream.write(len(path_bytes).to_bytes(2, 'big'))
output_stream.write(path_bytes)
output_stream.write(entry.mtime_ns.to_bytes(8, 'big', signed=True))
output_stream.write(entry.mode.to_bytes(4, 'big'))
output_stream.write(entry.uid.to_bytes(4, 'big'))
output_stream.write(entry.gid.to_bytes(4, 'big'))
output_stream.write(len(owner_bytes).to_bytes(2, 'big'))
output_stream.write(owner_bytes)
output_stream.write(len(group_bytes).to_bytes(2, 'big'))
output_stream.write(group_bytes)
logger.debug('Will create directory: %s', path)

# Send symlinks
for path, entry in sorted(src_symlinks_dict.items()):
dst_entry = dst_symlinks.get(path)
if dst_entry is None or dst_entry.target != entry.target:
output_stream.write(b'syml')
path_bytes = path.encode('utf-8')
target_bytes = entry.target.encode('utf-8')
output_stream.write(len(path_bytes).to_bytes(2, 'big'))
output_stream.write(path_bytes)
output_stream.write(len(target_bytes).to_bytes(2, 'big'))
output_stream.write(target_bytes)
output_stream.write(entry.mtime_ns.to_bytes(8, 'big', signed=True))
output_stream.write(entry.uid.to_bytes(4, 'big'))
output_stream.write(entry.gid.to_bytes(4, 'big'))
logger.debug('Will create/update symlink: %s -> %s', path, entry.target)

output_stream.flush()

# Process files with block-level diff
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retrieve logic doesn't handle type conflicts. For example, if a path is a directory on the destination but a file on the source, the code will send a FILE command but won't send a deletion command for the directory first. Similarly, if a path is a file on destination but a directory on source, it will send mkdr but won't delete the file first. The code should detect these conflicts and send appropriate dele commands before sending the new entry type.

Copilot uses AI. Check for mistakes.
Comment thread blockcopy.py
Comment on lines +1575 to +1577
symlink(target, full_path)
logger.debug('Created symlink: %s -> %s', full_path, target)

Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The symlink handling creates symlinks with arbitrary target paths received from the network without validation. A malicious source could send symlinks pointing to sensitive files (e.g., '../../../etc/passwd') which could lead to security issues if the symlinks are later followed. Consider adding validation to detect and warn about or reject symlinks that point outside the destination directory tree.

Suggested change
symlink(target, full_path)
logger.debug('Created symlink: %s -> %s', full_path, target)
# Validate symlink target to avoid pointing outside destination tree
base_dir = dir_path.resolve()
parent_dir = full_path.parent.resolve()
from pathlib import Path # Path is likely already imported elsewhere; repeated import is harmless
if Path(target).is_absolute():
effective_target = Path(target).resolve(strict=False)
else:
effective_target = (parent_dir / target).resolve(strict=False)
if not effective_target.is_relative_to(base_dir):
logger.warning(
'Refusing to create symlink %s -> %s: target %s is outside destination tree %s',
full_path,
target,
effective_target,
base_dir,
)
else:
symlink(target, full_path)
logger.debug('Created symlink: %s -> %s', full_path, target)

Copilot uses AI. Check for mistakes.
Comment thread blockcopy.py
Comment on lines +1438 to +1448
stat_result = file_path.stat()
atime_ns = stat_result.st_atime_ns
mtime_ns = stat_result.st_mtime_ns
mode = stat_result.st_mode
uid = stat_result.st_uid
gid = stat_result.st_gid
owner_name = get_user_name_by_uid(uid)
group_name = get_group_name_by_gid(gid)

with open(file_path, 'rb') as f:
total_size = f.seek(0, SEEK_END)
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metadata is read after writing the file blocks, which could result in a race condition. If the file is modified between when the blocks are read (lines 1391-1430) and when the metadata is read (line 1438), the metadata (particularly size and mtime) may not match the actual data that was sent. Consider reading the metadata before processing the file blocks, or using the file descriptor from the already-open file handle.

Copilot uses AI. Check for mistakes.
Comment thread blockcopy.py
full_path.unlink()
logger.debug('Deleted symlink: %s', full_path)
elif entry_type == ENTRY_DIR:
if full_path.is_dir():
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deletion logic for directories should check for symlinks first before checking is_dir(). A symlink to a directory will return True for both is_symlink() and is_dir(), which could lead to unexpected behavior. The code should handle the case where a directory entry in the delete list is actually a symlink to a directory on the destination, to avoid following and deleting the linked directory's contents.

Suggested change
if full_path.is_dir():
if full_path.is_symlink():
# Avoid following symlinks to directories; delete the link itself.
full_path.unlink()
logger.debug('Deleted symlink to directory: %s', full_path)
elif full_path.is_dir():

Copilot uses AI. Check for mistakes.
Comment thread blockcopy.py
rel_path = input_stream.read(path_len).decode('utf-8')
target_len = int.from_bytes(input_stream.read(2), 'big')
target = input_stream.read(target_len).decode('utf-8')
mtime_ns = int.from_bytes(input_stream.read(8), 'big', signed=True)
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assignment to 'mtime_ns' is unnecessary as it is redefined before this value is used.
This assignment to 'mtime_ns' is unnecessary as it is redefined before this value is used.

Suggested change
mtime_ns = int.from_bytes(input_stream.read(8), 'big', signed=True)
input_stream.read(8) # mtime_ns, ignored for symlinks

Copilot uses AI. Check for mistakes.
Comment thread blockcopy.py

# Open file for writing (create if doesn't exist)
if full_path.exists():
f = open(full_path, 'r+b')
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File may not be closed if this operation raises an exception.

Copilot uses AI. Check for mistakes.
Comment thread blockcopy.py
if full_path.exists():
f = open(full_path, 'r+b')
else:
f = open(full_path, 'wb')
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File may not be closed if this operation raises an exception.

Copilot uses AI. Check for mistakes.
Comment thread add_recursive_plan.md
Comment on lines +246 to +271
def do_save_recursive(dir_path, input_stream):
while True:
cmd = read_command()

if cmd == 'file':
path = read_path()
file_path = dir_path / path
file_path.parent.mkdir(parents=True, exist_ok=True)
receive_and_write_file(file_path)

elif cmd == 'mkdr':
path, meta = read_mkdir_data()
(dir_path / path).mkdir(parents=True, exist_ok=True)
apply_metadata(dir_path / path, meta)

elif cmd == 'syml':
path, target, meta = read_symlink_data()
(dir_path / path).symlink_to(target)
apply_metadata(dir_path / path, meta)

elif cmd == 'dele':
path, type = read_delete_data()
if type == 'file':
(dir_path / path).unlink()
elif type == 'dir':
shutil.rmtree(dir_path / path)
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The do_save_recursive design currently trusts path values received over the wire and directly uses dir_path / path for file creation, directory creation, symlink creation, and deletion without constraining them to stay under the destination root. A malicious or compromised peer running retrieve (or any process that can speak the protocol) can send paths containing .. segments or absolute paths and cause save to create, overwrite, or delete files and directories outside the intended destination tree (for example, modifying ~/.ssh/authorized_keys or /etc if permissions allow). To prevent this, treat incoming path values as untrusted: normalize and reject any that are absolute or escape the destination directory, and ensure operations do not follow attacker-controlled symlinks when resolving paths.

Suggested change
def do_save_recursive(dir_path, input_stream):
while True:
cmd = read_command()
if cmd == 'file':
path = read_path()
file_path = dir_path / path
file_path.parent.mkdir(parents=True, exist_ok=True)
receive_and_write_file(file_path)
elif cmd == 'mkdr':
path, meta = read_mkdir_data()
(dir_path / path).mkdir(parents=True, exist_ok=True)
apply_metadata(dir_path / path, meta)
elif cmd == 'syml':
path, target, meta = read_symlink_data()
(dir_path / path).symlink_to(target)
apply_metadata(dir_path / path, meta)
elif cmd == 'dele':
path, type = read_delete_data()
if type == 'file':
(dir_path / path).unlink()
elif type == 'dir':
shutil.rmtree(dir_path / path)
from pathlib import Path
def _resolve_under_root(root: Path, rel_path) -> Path:
"""
Resolve rel_path against root and ensure the resulting path
stays within root. Reject absolute paths and any path that
escapes the root directory.
"""
# Normalize input to Path
rel = Path(rel_path)
# Absolute paths are not allowed
if rel.is_absolute():
raise ValueError(f"Refusing absolute path in recursive save: {rel}")
base = root.resolve()
candidate = (base / rel).resolve(strict=False)
# Ensure candidate is under base (or equal to it)
if candidate != base and base not in candidate.parents:
raise ValueError(f"Refusing to operate outside destination root: {candidate}")
return candidate
def do_save_recursive(dir_path, input_stream):
while True:
cmd = read_command()
if cmd == 'file':
path = read_path()
file_path = _resolve_under_root(dir_path, path)
file_path.parent.mkdir(parents=True, exist_ok=True)
receive_and_write_file(file_path)
elif cmd == 'mkdr':
path, meta = read_mkdir_data()
dir_to_create = _resolve_under_root(dir_path, path)
dir_to_create.mkdir(parents=True, exist_ok=True)
apply_metadata(dir_to_create, meta)
elif cmd == 'syml':
path, target, meta = read_symlink_data()
link_path = _resolve_under_root(dir_path, path)
link_path.parent.mkdir(parents=True, exist_ok=True)
link_path.symlink_to(target)
apply_metadata(link_path, meta)
elif cmd == 'dele':
path, type = read_delete_data()
target_path = _resolve_under_root(dir_path, path)
if type == 'file':
target_path.unlink()
elif type == 'dir':
shutil.rmtree(target_path)

Copilot uses AI. Check for mistakes.
@messa messa self-assigned this Jan 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants