Add recursive directory synchronization (AI one-shot) by messa · Pull Request #9 · messa/blockcopy

messa · 2026-01-26T14:15:27Z

This is a completely AI one-shotted design plan + implementation.

There was only one intervention - to steer it away from completely removing the block-checksum-based incremental copy :)

Summary

This PR adds comprehensive design documentation for implementing recursive directory synchronization (--recursive flag) in blockcopy, along with a minor flake8 configuration adjustment.

Changes

Documentation

add_recursive_plan.md: New 558-line design document detailing the complete implementation plan for recursive directory synchronization
- Overview of current blockcopy functionality and proposed usage
- Detailed architecture with three phases: file list exchange, file transfer, and metadata application
- Binary protocol extensions with new commands (list, file, dirs, link, ends, getc, mkdr, syml, dele)
- Step-by-step synchronization flow with Python pseudocode
- New CLI arguments for --recursive, --delete, --exclude, --include, and --dry-run
- Comprehensive test plan covering unit tests, integration tests, and edge cases
- Implementation roadmap across 5 stages
- Discussion of backward compatibility, version detection, and risk mitigation

Configuration

.flake8: Increased max-complexity threshold from 70 to 80 to accommodate the more complex recursive implementation

Implementation Details

The design document establishes:

Protocol changes: New binary commands for directory listing, file metadata, and deletion operations
Three-stage synchronization: List exchange → change detection → atomic application
Fast path optimization: Quick comparison using mtime/size before full checksum validation
Special case handling: Empty directories, symlinks, hardlinks, sparse files, and permission preservation
Comprehensive testing strategy: 30+ test cases covering normal operations, edge cases, and error conditions
Phased rollout: 5 implementation stages from basic infrastructure to advanced filtering

This document serves as the specification for the recursive feature implementation and will guide development efforts.

Design document covering: - Protocol extensions for directory listing (list, file, dirs, link, ends, getc) - Protocol extensions for file transfer (mkdr, syml, dele) - Detailed workflow for checksum/retrieve/save stages - New CLI options (--recursive, --delete, --exclude, --include, --dry-run) - Edge case handling (symlinks, empty dirs, special files, permissions) - Comprehensive test plan (unit tests, integration tests, edge cases) - Implementation roadmap in 5 stages

Add support for synchronizing entire directory trees with --recursive flag: Features: - Scan and compare directory structures on source and destination - Transfer only changed files (based on size/mtime comparison) - Create missing directories on destination - Preserve symlinks (copy link target, not content) - Support --delete flag to remove extra files on destination - Full metadata preservation support (permissions, timestamps, owner/group) - LZMA compression support for transferred data New protocol commands: - list/ends: Directory listing exchange - file/dirs/link: Entry types in listing - FILE/fend: File data transfer markers - mkdr/syml: Create directories and symlinks - dele: Delete files/directories (with --delete) Implementation uses unidirectional protocol compatible with pipeline: blockcopy checksum -r /dst | ssh src blockcopy retrieve -r /src | blockcopy save -r /dst Tests cover: empty dirs, nested structures, symlinks, large files, incremental updates, deletion, special characters in filenames.

…-sync-uO2UZ

Changed the recursive mode protocol to use block-level checksums instead of transferring entire files. Now checksum sends SHA3-512 hashes for each 128KB block of every file, and retrieve compares them to send only changed blocks. This brings the recursive mode in line with the original single-file mode's efficiency - only modified portions of files are transferred. Added test_recursive_block_level_diff to verify the behavior.

Copilot

Pull request overview

This PR adds comprehensive design documentation and implementation for recursive directory synchronization to blockcopy, enabling it to sync entire directory trees similar to rsync -r.

Changes:

Added 558-line design document (in Czech) detailing the recursive sync architecture, protocol extensions, and implementation roadmap
Implemented recursive directory synchronization with three new functions: do_checksum_recursive, do_retrieve_recursive, and do_save_recursive
Added comprehensive test suite with 15 integration tests covering basic operations, incremental sync, symlinks, deletion, and block-level diff
Bumped version from 0.0.2 to 0.0.3 and increased flake8 max-complexity threshold

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 15 comments.

Show a summary per file

File	Description
add_recursive_plan.md	Design document describing the recursive sync architecture, binary protocol extensions, and implementation plan
blockcopy.py	Core implementation adding data classes for entries, directory scanning, serialization functions, and three recursive mode functions
tests/test_recursive.py	Comprehensive test suite with 15 tests covering recursive synchronization scenarios
uv.lock	Version bump from 0.0.2 to 0.0.3
.flake8	Increased max-complexity from 70 to 80 to accommodate recursive implementation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-26T14:33:58Z

+        except OSError as e:
+            logger.warning('Failed to read file %s for hashing: %s', file_path, e)
+
+        # End of file hashes
+        output_stream.write(b'fend')


If a file read fails during hashing (line 1227), the code logs a warning but continues to send 'fend' for that file. This could cause the retrieve side to receive a file entry with no hashes, which may not be handled correctly. Consider either skipping the file entirely (not sending the file entry) or sending an error marker so the retrieve side can handle the missing file appropriately.

Copilot · 2026-01-26T14:33:58Z

+                # Open file for writing (create if doesn't exist)
+                if full_path.exists():
+                    f = open(full_path, 'r+b')


When receiving a file (FILE command), the code doesn't handle the case where a directory or symlink already exists at that path. The code tries to open() the path (line 1600), which will fail if it's a directory. The symlink handler (line 1569-1570) correctly removes existing files/symlinks before creating a new symlink, but the FILE handler should do the same to handle type conflicts (e.g., when a directory on the destination needs to be replaced with a file from the source).

Suggested change

# Open file for writing (create if doesn't exist)

if full_path.exists():

f = open(full_path, 'r+b')

# Open file for writing (create if doesn't exist).

# If a non-regular file (e.g. directory or symlink) exists at this path,

# remove it first so that type conflicts are handled correctly.

if full_path.exists():

if full_path.is_file():

f = open(full_path, 'r+b')

else:

# Remove existing non-file (directory, symlink, etc.)

if full_path.is_dir() and not full_path.is_symlink():

# Recursively delete directory contents

for p in sorted(full_path.rglob('*'), reverse=True):

if p.is_file() or p.is_symlink():

p.unlink()

else:

p.rmdir()

full_path.rmdir()

else:

full_path.unlink()

f = open(full_path, 'wb')

Copilot · 2026-01-26T14:33:58Z

+    # Send directories to create (sorted by path to create parents first)
+    for path in sorted(src_dirs_dict.keys()):
+        if path not in dst_dirs:
+            entry = src_dirs_dict[path]
+            output_stream.write(b'mkdr')
+            path_bytes = path.encode('utf-8')
+            owner_bytes = entry.owner_name.encode('utf-8')
+            group_bytes = entry.group_name.encode('utf-8')
+            output_stream.write(len(path_bytes).to_bytes(2, 'big'))
+            output_stream.write(path_bytes)
+            output_stream.write(entry.mtime_ns.to_bytes(8, 'big', signed=True))
+            output_stream.write(entry.mode.to_bytes(4, 'big'))
+            output_stream.write(entry.uid.to_bytes(4, 'big'))
+            output_stream.write(entry.gid.to_bytes(4, 'big'))
+            output_stream.write(len(owner_bytes).to_bytes(2, 'big'))
+            output_stream.write(owner_bytes)
+            output_stream.write(len(group_bytes).to_bytes(2, 'big'))
+            output_stream.write(group_bytes)
+            logger.debug('Will create directory: %s', path)
+
+    # Send symlinks
+    for path, entry in sorted(src_symlinks_dict.items()):
+        dst_entry = dst_symlinks.get(path)
+        if dst_entry is None or dst_entry.target != entry.target:
+            output_stream.write(b'syml')
+            path_bytes = path.encode('utf-8')
+            target_bytes = entry.target.encode('utf-8')
+            output_stream.write(len(path_bytes).to_bytes(2, 'big'))
+            output_stream.write(path_bytes)
+            output_stream.write(len(target_bytes).to_bytes(2, 'big'))
+            output_stream.write(target_bytes)
+            output_stream.write(entry.mtime_ns.to_bytes(8, 'big', signed=True))
+            output_stream.write(entry.uid.to_bytes(4, 'big'))
+            output_stream.write(entry.gid.to_bytes(4, 'big'))
+            logger.debug('Will create/update symlink: %s -> %s', path, entry.target)
+
+    output_stream.flush()
+
+    # Process files with block-level diff


The retrieve logic doesn't handle type conflicts. For example, if a path is a directory on the destination but a file on the source, the code will send a FILE command but won't send a deletion command for the directory first. Similarly, if a path is a file on destination but a directory on source, it will send mkdr but won't delete the file first. The code should detect these conflicts and send appropriate dele commands before sending the new entry type.

Copilot · 2026-01-26T14:33:59Z

+                symlink(target, full_path)
+                logger.debug('Created symlink: %s -> %s', full_path, target)
+


The symlink handling creates symlinks with arbitrary target paths received from the network without validation. A malicious source could send symlinks pointing to sensitive files (e.g., '../../../etc/passwd') which could lead to security issues if the symlinks are later followed. Consider adding validation to detect and warn about or reject symlinks that point outside the destination directory tree.

Suggested change

symlink(target, full_path)

logger.debug('Created symlink: %s -> %s', full_path, target)

# Validate symlink target to avoid pointing outside destination tree

base_dir = dir_path.resolve()

parent_dir = full_path.parent.resolve()

from pathlib import Path # Path is likely already imported elsewhere; repeated import is harmless

if Path(target).is_absolute():

effective_target = Path(target).resolve(strict=False)

else:

effective_target = (parent_dir / target).resolve(strict=False)

if not effective_target.is_relative_to(base_dir):

logger.warning(

'Refusing to create symlink %s -> %s: target %s is outside destination tree %s',

full_path,

target,

effective_target,

base_dir,

)

else:

symlink(target, full_path)

logger.debug('Created symlink: %s -> %s', full_path, target)

Copilot · 2026-01-26T14:33:59Z

+        stat_result = file_path.stat()
+        atime_ns = stat_result.st_atime_ns
+        mtime_ns = stat_result.st_mtime_ns
+        mode = stat_result.st_mode
+        uid = stat_result.st_uid
+        gid = stat_result.st_gid
+        owner_name = get_user_name_by_uid(uid)
+        group_name = get_group_name_by_gid(gid)
+
+        with open(file_path, 'rb') as f:
+            total_size = f.seek(0, SEEK_END)


The metadata is read after writing the file blocks, which could result in a race condition. If the file is modified between when the blocks are read (lines 1391-1430) and when the metadata is read (line 1438), the metadata (particularly size and mtime) may not match the actual data that was sent. Consider reading the metadata before processing the file blocks, or using the file descriptor from the already-open file handle.

Copilot · 2026-01-26T14:34:00Z

+                        full_path.unlink()
+                        logger.debug('Deleted symlink: %s', full_path)
+                elif entry_type == ENTRY_DIR:
+                    if full_path.is_dir():


The deletion logic for directories should check for symlinks first before checking is_dir(). A symlink to a directory will return True for both is_symlink() and is_dir(), which could lead to unexpected behavior. The code should handle the case where a directory entry in the delete list is actually a symlink to a directory on the destination, to avoid following and deleting the linked directory's contents.

Suggested change

if full_path.is_dir():

if full_path.is_symlink():

# Avoid following symlinks to directories; delete the link itself.

full_path.unlink()

logger.debug('Deleted symlink to directory: %s', full_path)

elif full_path.is_dir():

Copilot · 2026-01-26T14:34:01Z

+                rel_path = input_stream.read(path_len).decode('utf-8')
+                target_len = int.from_bytes(input_stream.read(2), 'big')
+                target = input_stream.read(target_len).decode('utf-8')
+                mtime_ns = int.from_bytes(input_stream.read(8), 'big', signed=True)


This assignment to 'mtime_ns' is unnecessary as it is redefined before this value is used.
This assignment to 'mtime_ns' is unnecessary as it is redefined before this value is used.

Suggested change

mtime_ns = int.from_bytes(input_stream.read(8), 'big', signed=True)

input_stream.read(8) # mtime_ns, ignored for symlinks

Copilot · 2026-01-26T14:34:01Z

+
+                # Open file for writing (create if doesn't exist)
+                if full_path.exists():
+                    f = open(full_path, 'r+b')


File may not be closed if this operation raises an exception.

Copilot · 2026-01-26T14:34:01Z

+                if full_path.exists():
+                    f = open(full_path, 'r+b')
+                else:
+                    f = open(full_path, 'wb')


File may not be closed if this operation raises an exception.

Copilot · 2026-01-26T14:34:02Z

+def do_save_recursive(dir_path, input_stream):
+    while True:
+        cmd = read_command()
+
+        if cmd == 'file':
+            path = read_path()
+            file_path = dir_path / path
+            file_path.parent.mkdir(parents=True, exist_ok=True)
+            receive_and_write_file(file_path)
+
+        elif cmd == 'mkdr':
+            path, meta = read_mkdir_data()
+            (dir_path / path).mkdir(parents=True, exist_ok=True)
+            apply_metadata(dir_path / path, meta)
+
+        elif cmd == 'syml':
+            path, target, meta = read_symlink_data()
+            (dir_path / path).symlink_to(target)
+            apply_metadata(dir_path / path, meta)
+
+        elif cmd == 'dele':
+            path, type = read_delete_data()
+            if type == 'file':
+                (dir_path / path).unlink()
+            elif type == 'dir':
+                shutil.rmtree(dir_path / path)


The do_save_recursive design currently trusts path values received over the wire and directly uses dir_path / path for file creation, directory creation, symlink creation, and deletion without constraining them to stay under the destination root. A malicious or compromised peer running retrieve (or any process that can speak the protocol) can send paths containing .. segments or absolute paths and cause save to create, overwrite, or delete files and directories outside the intended destination tree (for example, modifying ~/.ssh/authorized_keys or /etc if permissions allow). To prevent this, treat incoming path values as untrusted: normalize and reject any that are absolute or escape the destination directory, and ensure operations do not follow attacker-controlled symlinks when resolving paths.

Suggested change

def do_save_recursive(dir_path, input_stream):

while True:

cmd = read_command()

if cmd == 'file':

path = read_path()

file_path = dir_path / path

file_path.parent.mkdir(parents=True, exist_ok=True)

receive_and_write_file(file_path)

elif cmd == 'mkdr':

path, meta = read_mkdir_data()

(dir_path / path).mkdir(parents=True, exist_ok=True)

apply_metadata(dir_path / path, meta)

elif cmd == 'syml':

path, target, meta = read_symlink_data()

(dir_path / path).symlink_to(target)

apply_metadata(dir_path / path, meta)

elif cmd == 'dele':

path, type = read_delete_data()

if type == 'file':

(dir_path / path).unlink()

elif type == 'dir':

shutil.rmtree(dir_path / path)

from pathlib import Path

def _resolve_under_root(root: Path, rel_path) -> Path:

"""

Resolve rel_path against root and ensure the resulting path

stays within root. Reject absolute paths and any path that

escapes the root directory.

"""

# Normalize input to Path

rel = Path(rel_path)

# Absolute paths are not allowed

if rel.is_absolute():

raise ValueError(f"Refusing absolute path in recursive save: {rel}")

base = root.resolve()

candidate = (base / rel).resolve(strict=False)

# Ensure candidate is under base (or equal to it)

if candidate != base and base not in candidate.parents:

raise ValueError(f"Refusing to operate outside destination root: {candidate}")

return candidate

def do_save_recursive(dir_path, input_stream):

while True:

cmd = read_command()

if cmd == 'file':

path = read_path()

file_path = _resolve_under_root(dir_path, path)

file_path.parent.mkdir(parents=True, exist_ok=True)

receive_and_write_file(file_path)

elif cmd == 'mkdr':

path, meta = read_mkdir_data()

dir_to_create = _resolve_under_root(dir_path, path)

dir_to_create.mkdir(parents=True, exist_ok=True)

apply_metadata(dir_to_create, meta)

elif cmd == 'syml':

path, target, meta = read_symlink_data()

link_path = _resolve_under_root(dir_path, path)

link_path.parent.mkdir(parents=True, exist_ok=True)

link_path.symlink_to(target)

apply_metadata(link_path, meta)

elif cmd == 'dele':

path, type = read_delete_data()

target_path = _resolve_under_root(dir_path, path)

if type == 'file':

target_path.unlink()

elif type == 'dir':

shutil.rmtree(target_path)

claude added 5 commits January 26, 2026 13:13

Update uv.lock

c09cbe1

Merge remote-tracking branch 'origin/main' into claude/plan-recursive…

21303f1

…-sync-uO2UZ

Copilot AI review requested due to automatic review settings January 26, 2026 14:15

Copilot started reviewing on behalf of messa January 26, 2026 14:15 View session

messa marked this pull request as draft January 26, 2026 14:16

messa changed the title ~~Add recursive directory synchronization design document~~ Add recursive directory synchronization (AI one-shot) Jan 26, 2026

Copilot AI reviewed Jan 26, 2026

View reviewed changes

messa self-assigned this Jan 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add recursive directory synchronization (AI one-shot)#9

Add recursive directory synchronization (AI one-shot)#9
messa wants to merge 5 commits intomainfrom
claude/plan-recursive-sync-uO2UZ

messa commented Jan 26, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Copilot AI Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-                # Open file for writing (create if doesn't exist)
-                if full_path.exists():
-                    f = open(full_path, 'r+b')
+                # Open file for writing (create if doesn't exist).
+                # If a non-regular file (e.g. directory or symlink) exists at this path,
+                # remove it first so that type conflicts are handled correctly.
+                if full_path.exists():
+                    if full_path.is_file():
+                        f = open(full_path, 'r+b')
+                    else:
+                        # Remove existing non-file (directory, symlink, etc.)
+                        if full_path.is_dir() and not full_path.is_symlink():
+                            # Recursively delete directory contents
+                            for p in sorted(full_path.rglob('*'), reverse=True):
+                                if p.is_file() or p.is_symlink():
+                                    p.unlink()
+                                else:
+                                    p.rmdir()
+                            full_path.rmdir()
+                        else:
+                            full_path.unlink()
+                        f = open(full_path, 'wb')

		symlink(target, full_path)
		logger.debug('Created symlink: %s -> %s', full_path, target)

-                symlink(target, full_path)
-                logger.debug('Created symlink: %s -> %s', full_path, target)
+                # Validate symlink target to avoid pointing outside destination tree
+                base_dir = dir_path.resolve()
+                parent_dir = full_path.parent.resolve()
+                from pathlib import Path  # Path is likely already imported elsewhere; repeated import is harmless
+                if Path(target).is_absolute():
+                    effective_target = Path(target).resolve(strict=False)
+                else:
+                    effective_target = (parent_dir / target).resolve(strict=False)
+                if not effective_target.is_relative_to(base_dir):
+                    logger.warning(
+                        'Refusing to create symlink %s -> %s: target %s is outside destination tree %s',
+                        full_path,
+                        target,
+                        effective_target,
+                        base_dir,
+                    )
+                else:
+                    symlink(target, full_path)
+                    logger.debug('Created symlink: %s -> %s', full_path, target)

-                    if full_path.is_dir():
+                    if full_path.is_symlink():
+                        # Avoid following symlinks to directories; delete the link itself.
+                        full_path.unlink()
+                        logger.debug('Deleted symlink to directory: %s', full_path)
+                    elif full_path.is_dir():

	mtime_ns = int.from_bytes(input_stream.read(8), 'big', signed=True)
	input_stream.read(8) # mtime_ns, ignored for symlinks

Conversation

messa commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Documentation

Configuration

Implementation Details

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

messa commented Jan 26, 2026 •

edited

Loading