Skip to content

Conversation

@pablogsal
Copy link
Member

@pablogsal pablogsal commented Dec 15, 2025

Defines the API and data structures for a high-performance binary
format for profiling data. The format uses string/frame deduplication,
varint encoding, and delta compression to achieve 10-50x size reduction
compared to text formats. Optional zstd compression provides additional
savings.

The header includes inline varint encode/decode functions since these
are called in tight loops during both writing and reading. Structures
for both writer (BinaryWriter) and reader (BinaryReader) are defined
here to allow the module.c bindings to allocate them.
Implements streaming binary output with delta compression. The writer
tracks per-thread state to encode stack changes efficiently: identical
stacks use RLE, similar stacks encode only the differing frames.

String and frame deduplication uses Python's hashtable implementation
for O(1) lookup during interning. The 512KB write buffer amortizes
syscall overhead. When zstd is available, data streams through
compression before hitting disk.

Finalization writes the string/frame tables and footer, then seeks
back to update the header with final counts and offsets.
Implements binary file parsing with stack reconstruction. On Unix,
uses mmap with MADV_SEQUENTIAL for efficient sequential access. Falls
back to buffered I/O on Windows.

The reader reconstructs full stacks from delta-encoded records by
maintaining per-thread state. Each sample's stack is rebuilt by
applying the encoded operation (repeat/suffix/pop-push) to the
previous stack for that thread.

Replay feeds reconstructed samples to any collector, enabling
conversion between formats without re-profiling.
Adds binary_io_writer.c and binary_io_reader.c to the _remote_debugging
module compilation. Also hooks up optional zstd support: when libzstd
is found by pkg-config, the module compiles with HAVE_ZSTD defined and
links against libzstd. Without zstd, the module still builds but
compression is unavailable.
Adds binary_io_writer.c, binary_io_reader.c, and binary_io.h to the
Visual Studio project for _remote_debugging.
Exposes BinaryWriter and BinaryReader as Python types in
_remote_debugging module. BinaryWriter wraps the C writer with
write_sample() and finalize() methods. BinaryReader provides replay()
to feed samples through any collector.

Also adds zstd_available() function to let Python code check whether
compression support was compiled in.
Thin wrapper around the C BinaryWriter. Implements the Collector
interface so it can be used interchangeably with other collectors
like FlamegraphCollector or GeckoCollector.

Compression is configurable: 'auto' uses zstd when available, 'zstd'
requires it, 'none' disables compression. The collector passes
samples directly to C for encoding without building Python data
structures.
Wrapper around the C BinaryReader providing file info access and
replay functionality. The replay() method reconstructs samples from
the binary file and feeds them to any collector, enabling format
conversion without re-profiling.

Includes get_info() for metadata access (sample count, thread count,
compression type) and get_stats() for decoding statistics.
Adds --binary output format and --compression option to run/attach
commands. The replay command converts binary profiles to other formats:

    python -m profiling.sampling replay profile.bin
    python -m profiling.sampling replay --flamegraph -o out.html profile.bin

This enables a record-and-replay workflow: capture in binary format
during profiling (faster, smaller files), then convert to visualization
formats later without re-profiling.
Adds optional timestamp_us parameter to Collector.collect() method.
During live profiling this is None and collectors use their own timing.
During binary replay the stored timestamp is passed through, allowing
collectors to reconstruct the original timing.

Also fixes gecko_collector to use time.monotonic() instead of time.time()
for consistency with other collectors.
Tests cover the full write/read cycle, delta encoding (RLE, suffix,
pop-push), compression modes, edge cases (empty files, deep stacks,
many threads), and replay through different collectors.

The mock-based tests verify encoding behavior without needing actual
profiling, while integration tests exercise the complete pipeline.
Documents the file layout, encoding schemes, and design rationale.
Covers header/footer structure, delta encoding types (repeat, suffix,
pop-push), string/frame deduplication, and compression integration.

Intended for developers working on the profiler implementation.
Adds user documentation for --binary output format and the replay
command. Covers compression options, the record-and-replay workflow,
and examples of converting between formats.
@pablogsal pablogsal changed the title Add binary I/O header for sampling profiler gh-142636: Allow tachyon to write and read binary output Dec 15, 2025
@pablogsal pablogsal changed the title gh-142636: Allow tachyon to write and read binary output gh-138122: Allow tachyon to write and read binary output Dec 15, 2025
@pablogsal
Copy link
Member Author

I ran some benchmarks to validate the binary format implementation. Here's what I found.

The test workload ran a bunch of tests from the test suite (test_list, test_dict, test_tokenize, test_exceptions, test_syntax, test_threading) taking approx 28 seconds on Linux with my Intel hybrid CPU clocked at 4.9 GHz, using ZSTD level 5 streaming compression with a 2 MB window.

The binary writer hits 199,175 samples/second in this run, capturing 5.6 million samples. For reference that enough to profile 199 threads at once with 1ms sampling. I ran perf record --call-graph dwarf to see where CPU time actually goes. Here's the breakdown by shared object:

63.1%  python (interpreter running the tests)
31.9%  _remote_debugging (the profiler extension)
 3.9%  libc
 0.19% libzstd

ZSTD compression is 0.19% of total CPU time. The binary format overhead is essentially free.

Within the profiler extension, the hot functions are:

13.2%  _Py_RemoteDebug_PagedReadRemoteMemory (reading target process memory)
 7.4%  _remote_debugging_RemoteUnwinder_get_stack_trace_impl (unwinding stacks)
 3.5%  process_thread_sample
 1.2%  frame_key_compare_func
 1.2%  parse_linetable
 0.7%  string_compare_func
 0.6%  string_hash_func

The binary writing and compression functions don't even show up in the profile: they're below the 0.5% threshold. All the profiler overhead is in reading remote memory and unwinding stacks, not in the output format. Compression gets a 159.6x ratio on profiling data, turning 74.51 bytes per sample into 0.47 bytes. A 1-hour profile at 1000 samples/sec that would normally take 268 MB on disk shrinks to just 1.7 MB. The interning system stores each unique string once and references it an average of 1,658 times. Each unique frame gets referenced 653 times. Without interning, string data alone would be 79 MB. With interning, it's 42.6 KB.

Raw numbers from the run:

Captured 5,659,261 samples in 28.41 seconds
Sample rate: 199,175.43 samples/sec
Error rate: 0.31%

Binary Encoding:
  Records:          111,957
    RLE repeat:     30,314 (27.1%) [5,559,903 samples]
    Full stack:     4,426 (4.0%)
    Suffix match:   22,065 (19.7%)
    Pop-push:       55,152 (49.3%)

Frame Efficiency:
  Frames written:   140,547
  Frames saved:     192,318,495 (99.9%)

The encoding stats show RLE (run-length encoding) is working well as 27% of records are RLE repeats covering 5.5M samples. The frame efficiency shows we're saving 99.9% of frame writes through the encoding schemes.

pablogsal and others added 2 commits December 15, 2025 18:36
Merged changes from upstream/main including:
- Subprocess enumeration functionality (get_child_pids, is_python_process)
- Various fixes and improvements

Combined with file-output branch features:
- Binary I/O writer and reader for profiling data
- Binary format export/replay support

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant