gh-138122: Allow tachyon to write and read binary output #142730

pablogsal · 2025-12-15T01:57:44Z

📚 Documentation preview 📚: https://cpython-previews--142730.org.readthedocs.build/

Issue: Implement PEP 799 – A dedicated profiling package for organizing Python profiling tools #138122

Defines the API and data structures for a high-performance binary format for profiling data. The format uses string/frame deduplication, varint encoding, and delta compression to achieve 10-50x size reduction compared to text formats. Optional zstd compression provides additional savings. The header includes inline varint encode/decode functions since these are called in tight loops during both writing and reading. Structures for both writer (BinaryWriter) and reader (BinaryReader) are defined here to allow the module.c bindings to allocate them.

Implements streaming binary output with delta compression. The writer tracks per-thread state to encode stack changes efficiently: identical stacks use RLE, similar stacks encode only the differing frames. String and frame deduplication uses Python's hashtable implementation for O(1) lookup during interning. The 512KB write buffer amortizes syscall overhead. When zstd is available, data streams through compression before hitting disk. Finalization writes the string/frame tables and footer, then seeks back to update the header with final counts and offsets.

Implements binary file parsing with stack reconstruction. On Unix, uses mmap with MADV_SEQUENTIAL for efficient sequential access. Falls back to buffered I/O on Windows. The reader reconstructs full stacks from delta-encoded records by maintaining per-thread state. Each sample's stack is rebuilt by applying the encoded operation (repeat/suffix/pop-push) to the previous stack for that thread. Replay feeds reconstructed samples to any collector, enabling conversion between formats without re-profiling.

Adds binary_io_writer.c and binary_io_reader.c to the _remote_debugging module compilation. Also hooks up optional zstd support: when libzstd is found by pkg-config, the module compiles with HAVE_ZSTD defined and links against libzstd. Without zstd, the module still builds but compression is unavailable.

Adds binary_io_writer.c, binary_io_reader.c, and binary_io.h to the Visual Studio project for _remote_debugging.

Exposes BinaryWriter and BinaryReader as Python types in _remote_debugging module. BinaryWriter wraps the C writer with write_sample() and finalize() methods. BinaryReader provides replay() to feed samples through any collector. Also adds zstd_available() function to let Python code check whether compression support was compiled in.

Thin wrapper around the C BinaryWriter. Implements the Collector interface so it can be used interchangeably with other collectors like FlamegraphCollector or GeckoCollector. Compression is configurable: 'auto' uses zstd when available, 'zstd' requires it, 'none' disables compression. The collector passes samples directly to C for encoding without building Python data structures.

Wrapper around the C BinaryReader providing file info access and replay functionality. The replay() method reconstructs samples from the binary file and feeds them to any collector, enabling format conversion without re-profiling. Includes get_info() for metadata access (sample count, thread count, compression type) and get_stats() for decoding statistics.

Adds --binary output format and --compression option to run/attach commands. The replay command converts binary profiles to other formats: python -m profiling.sampling replay profile.bin python -m profiling.sampling replay --flamegraph -o out.html profile.bin This enables a record-and-replay workflow: capture in binary format during profiling (faster, smaller files), then convert to visualization formats later without re-profiling.

Adds optional timestamp_us parameter to Collector.collect() method. During live profiling this is None and collectors use their own timing. During binary replay the stored timestamp is passed through, allowing collectors to reconstruct the original timing. Also fixes gecko_collector to use time.monotonic() instead of time.time() for consistency with other collectors.

Tests cover the full write/read cycle, delta encoding (RLE, suffix, pop-push), compression modes, edge cases (empty files, deep stacks, many threads), and replay through different collectors. The mock-based tests verify encoding behavior without needing actual profiling, while integration tests exercise the complete pipeline.

Documents the file layout, encoding schemes, and design rationale. Covers header/footer structure, delta encoding types (repeat, suffix, pop-push), string/frame deduplication, and compression integration. Intended for developers working on the profiler implementation.

Adds user documentation for --binary output format and the replay command. Covers compression options, the record-and-replay workflow, and examples of converting between formats.

pablogsal · 2025-12-15T02:15:51Z

I ran some benchmarks to validate the binary format implementation. Here's what I found.

The test workload ran a bunch of tests from the test suite (test_list, test_dict, test_tokenize, test_exceptions, test_syntax, test_threading) taking approx 28 seconds on Linux with my Intel hybrid CPU clocked at 4.9 GHz, using ZSTD level 5 streaming compression with a 2 MB window.

The binary writer hits 199,175 samples/second in this run, capturing 5.6 million samples. For reference that enough to profile 199 threads at once with 1ms sampling. I ran perf record --call-graph dwarf to see where CPU time actually goes. Here's the breakdown by shared object:

63.1%  python (interpreter running the tests)
31.9%  _remote_debugging (the profiler extension)
 3.9%  libc
 0.19% libzstd

ZSTD compression is 0.19% of total CPU time. The binary format overhead is essentially free.

Within the profiler extension, the hot functions are:

13.2%  _Py_RemoteDebug_PagedReadRemoteMemory (reading target process memory)
 7.4%  _remote_debugging_RemoteUnwinder_get_stack_trace_impl (unwinding stacks)
 3.5%  process_thread_sample
 1.2%  frame_key_compare_func
 1.2%  parse_linetable
 0.7%  string_compare_func
 0.6%  string_hash_func

The binary writing and compression functions don't even show up in the profile: they're below the 0.5% threshold. All the profiler overhead is in reading remote memory and unwinding stacks, not in the output format. Compression gets a 159.6x ratio on profiling data, turning 74.51 bytes per sample into 0.47 bytes. A 1-hour profile at 1000 samples/sec that would normally take 268 MB on disk shrinks to just 1.7 MB. The interning system stores each unique string once and references it an average of 1,658 times. Each unique frame gets referenced 653 times. Without interning, string data alone would be 79 MB. With interning, it's 42.6 KB.

Raw numbers from the run:

Captured 5,659,261 samples in 28.41 seconds
Sample rate: 199,175.43 samples/sec
Error rate: 0.31%

Binary Encoding:
  Records:          111,957
    RLE repeat:     30,314 (27.1%) [5,559,903 samples]
    Full stack:     4,426 (4.0%)
    Suffix match:   22,065 (19.7%)
    Pop-push:       55,152 (49.3%)

Frame Efficiency:
  Frames written:   140,547
  Frames saved:     192,318,495 (99.9%)

The encoding stats show RLE (run-length encoding) is working well as 27% of records are RLE repeats covering 5.5M samples. The frame efficiency shows we're saving 99.9% of frame writes through the encoding schemes.

Merged changes from upstream/main including: - Subprocess enumeration functionality (get_child_pids, is_python_process) - Various fixes and improvements Combined with file-output branch features: - Binary I/O writer and reader for profiling data - Binary format export/replay support 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

pablogsal added 13 commits December 15, 2025 01:47

Build: add binary I/O files to Windows build

8931b4a

Adds binary_io_writer.c, binary_io_reader.c, and binary_io.h to the Visual Studio project for _remote_debugging.

Document binary format and replay command

3ad7a3d

Adds user documentation for --binary output format and the replay command. Covers compression options, the record-and-replay workflow, and examples of converting between formats.

pablogsal changed the title ~~Add binary I/O header for sampling profiler~~ gh-142636: Allow tachyon to write and read binary output Dec 15, 2025

Fix CI

cd4f412

pablogsal changed the title ~~gh-142636: Allow tachyon to write and read binary output~~ gh-138122: Allow tachyon to write and read binary output Dec 15, 2025

bedevere-app bot mentioned this pull request Dec 15, 2025

Implement PEP 799 – A dedicated profiling package for organizing Python profiling tools #138122

Open

Add NEWS entry

596af7f

pablogsal force-pushed the file-output branch from 1d96be1 to 92e3b0c Compare December 15, 2025 02:36

Regen and simplify

1e2400b

pablogsal force-pushed the file-output branch from 92e3b0c to 1e2400b Compare December 15, 2025 02:38

pablogsal and others added 2 commits December 15, 2025 18:36

Speed up general case

788c565

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-138122: Allow tachyon to write and read binary output #142730

gh-138122: Allow tachyon to write and read binary output #142730

pablogsal commented Dec 15, 2025 •

edited by bedevere-app bot

Loading

Uh oh!

pablogsal commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

gh-138122: Allow tachyon to write and read binary output #142730

Are you sure you want to change the base?

gh-138122: Allow tachyon to write and read binary output #142730

Conversation

pablogsal commented Dec 15, 2025 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pablogsal commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pablogsal commented Dec 15, 2025 •

edited by bedevere-app bot

Loading