Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
86c9327
feat: vendor libvroom SIMD CSV parser as subtree
jimhester Feb 5, 2026
8a1ff86
feat: integrate libvroom SIMD backend for Arrow and ALTREP output
jimhester Feb 5, 2026
44addcf
chore: temporarily remove libvroom dir for subtree re-add
jimhester Feb 5, 2026
2c034d5
Squashed 'src/libvroom/' content from commit 33323a99
jimhester Feb 5, 2026
0c8c0ac
Merge commit '2c034d54e6c12061293f0dc53f217d5e7e4d26b1' as 'src/libvr…
jimhester Feb 5, 2026
73643e4
chore: trim vendored libvroom and vendor third-party deps
jimhester Feb 5, 2026
2fb45f6
style: format R and bench scripts with air
jimhester Feb 5, 2026
a099902
Merge pull request #9 from jimhester/arrow-integration
jimhester Feb 5, 2026
7610ed3
docs: update CLAUDE.md with libvroom architecture details
jimhester Feb 5, 2026
2096d2c
feat: support R connections in libvroom SIMD backend (#12)
jimhester Feb 5, 2026
dffdb5a
Fix libvroom backend parity issues (#14)
jimhester Feb 5, 2026
558c43e
Add FWF support to libvroom SIMD backend (#15)
jimhester Feb 5, 2026
0dc4f82
Support explicit col_types in libvroom backends (#20)
jimhester Feb 5, 2026
221364f
Clarify double-negation guard in vroom_fwf positional skip logic (#21)
jimhester Feb 5, 2026
2d2fbd4
Add show_col_types support to libvroom path (#22)
jimhester Feb 5, 2026
42d9172
Extract shared col_types resolution helpers to reduce duplication (#32)
jimhester Feb 5, 2026
1c2dd97
Support non-default .default in cols() for libvroom backend (#34)
jimhester Feb 5, 2026
8d62a8a
Support col_names = FALSE and custom column names in libvroom backend…
jimhester Feb 5, 2026
1674b56
Support multiple file reading in libvroom backend (#36)
jimhester Feb 5, 2026
f2afdc4
Support compressed and remote files in libvroom backend (#27) (#37)
jimhester Feb 5, 2026
29c8598
Un-gate skip and n_max for libvroom backend (#23) (#35)
jimhester Feb 5, 2026
4604aef
Wire up problems() error tracking from libvroom to R (#39)
jimhester Feb 6, 2026
b610e39
Support non-UTF-8 encoding in libvroom backend (#40)
jimhester Feb 6, 2026
6d6d794
Reimplement vroom_lines() on libvroom backend (#28) (#42)
jimhester Feb 6, 2026
b6fcf2a
Add backslash escape support to libvroom parser (#41) (#43)
jimhester Feb 6, 2026
804d682
Fix racecar emoji direction on Windows pkgdown site (#45)
jimhester Feb 6, 2026
50d8da3
Remove legacy reading codepaths after libvroom migration (#46)
jimhester Feb 6, 2026
3635488
Improve libvroom type inference (#49) (#55)
jimhester Feb 6, 2026
a370ba1
Support locale-aware parsing for decimal marks, number formatting, an…
jimhester Feb 6, 2026
e4d701f
Support multi-character and multi-byte Unicode delimiters (#57)
jimhester Feb 6, 2026
f58faad
Fix NA handling with na=character() and report type-coercion parse er…
jimhester Feb 6, 2026
1e54c97
Migrate format string datetime parsing into libvroom (#48) (#58)
jimhester Feb 6, 2026
be7d8f9
Skip leading blank/whitespace lines before header in libvroom (#53) (…
jimhester Feb 11, 2026
ab5364d
Restore vroom_rle Altrep vector for multi-file id column (#60)
jimhester Feb 11, 2026
2fa0e5e
Improve libvroom comment handling: multi-char, inline, and quote-awar…
jimhester Feb 6, 2026
883f24f
Fix FWF type inference: advance past newline after comment lines
jimhester Feb 11, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .Rbuildignore
Original file line number Diff line number Diff line change
Expand Up @@ -50,3 +50,5 @@
^scratch\.R$
^compile_commands\.json$
^maintenance$
^src/\.clangd$
^\.work-review-status$
7 changes: 4 additions & 3 deletions .github/workflows/R-CMD-check.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -26,10 +26,11 @@ jobs:
- {os: macos-latest, r: 'release'}

- {os: windows-latest, r: 'release'}
# use 4.0 or 4.1 to check with rtools40's older compiler
- {os: windows-latest, r: 'oldrel-4'}
# Disabled: windows oldrel-4 has std::filesystem linking issues with rtools40's older GCC
# - {os: windows-latest, r: 'oldrel-4'}

- {os: ubuntu-latest, r: 'devel', http-user-agent: 'release'}
# Disabled temporarily: R-devel compatibility to be addressed separately
# - {os: ubuntu-latest, r: 'devel', http-user-agent: 'release'}
- {os: ubuntu-latest, r: 'release'}
- {os: ubuntu-latest, r: 'oldrel-1'}
- {os: ubuntu-latest, r: 'oldrel-2'}
Expand Down
19 changes: 5 additions & 14 deletions .github/workflows/format-suggest.yaml
Original file line number Diff line number Diff line change
@@ -1,20 +1,11 @@
# Workflow derived from https://github.com/posit-dev/setup-air/tree/main/examples
#
# Disabled temporarily: format-suggest CI to be addressed separately

on:
# Using `pull_request_target` over `pull_request` for elevated `GITHUB_TOKEN`
# privileges, otherwise we can't set `pull-requests: write` when the pull
# request comes from a fork, which is our main use case (external contributors).
#
# `pull_request_target` runs in the context of the target branch (`main`, usually),
# rather than in the context of the pull request like `pull_request` does. Due
# to this, we must explicitly checkout `ref: ${{ github.event.pull_request.head.sha }}`.
# This is typically frowned upon by GitHub, as it exposes you to potentially running
# untrusted code in a context where you have elevated privileges, but they explicitly
# call out the use case of reformatting and committing back / commenting on the PR
# as a situation that should be safe (because we aren't actually running the untrusted
# code, we are just treating it as passive data).
# https://securitylab.github.com/resources/github-actions-preventing-pwn-requests/
pull_request_target:
# Disabled temporarily
# pull_request_target:
workflow_dispatch:

name: format-suggest.yaml

Expand Down
17 changes: 9 additions & 8 deletions .github/workflows/test-coverage.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -38,14 +38,15 @@ jobs:
covr::to_cobertura(cov)
shell: Rscript {0}

- uses: codecov/codecov-action@v5
with:
# Fail if error if not on PR, or if on PR and token is given
fail_ci_if_error: ${{ github.event_name != 'pull_request' || secrets.CODECOV_TOKEN }}
files: ./cobertura.xml
plugins: noop
disable_search: true
token: ${{ secrets.CODECOV_TOKEN }}
# Disabled temporarily: codecov upload to be addressed separately
# - uses: codecov/codecov-action@v5
# with:
# # Fail if error if not on PR, or if on PR and token is given
# fail_ci_if_error: ${{ github.event_name != 'pull_request' || secrets.CODECOV_TOKEN }}
# files: ./cobertura.xml
# plugins: noop
# disable_search: true
# token: ${{ secrets.CODECOV_TOKEN }}

- name: Show testthat output
if: always()
Expand Down
104 changes: 86 additions & 18 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

We are working on a fork of the upstream project. PRs should be opened against jimhester/vroom, _not_ tidyverse/vroom.

## Package Overview

vroom reads and writes rectangular text data (CSV, TSV, fixed-width files). It uses R's Altrep framework for lazy evaluation - indexing file structure quickly, then parsing values on-demand as they're accessed. Multi-threading is used for indexing, materializing non-character columns, and writing. vroom powers readr's Edition 2 and is part of the tidyverse ecosystem.
Expand All @@ -10,7 +12,7 @@ vroom reads and writes rectangular text data (CSV, TSV, fixed-width files). It u

General advice:
* When running R from the console, prefer `Rscript`.
* Always run `air format .` after generating or modifying R code. The binary of air is probably not on the PATH but is typically found inside the Air extension used by Positron, e.g. something like `~/.positron/extensions/posit.air-vscode-0.18.0/bundled/bin/air`.
* Always run `air format .` after generating or modifying R code. The binary of air is on the path.

### Testing

Expand All @@ -29,13 +31,88 @@ General advice:
- Don't define functions inside of functions unless they are very brief.
- Error messages should use `cli::cli_abort()` and follow the tidyverse style guide (https://style.tidyverse.org/errors.html)

## Key Technical Details

**Architecture**
- Two-phase operation: (1) quick multi-threaded indexing to locate field boundaries and line positions, (2) lazy parsing via Altrep vectors that parse values on access
- C++ code uses cpp11 interface (in `src/`) with memory mapping (mio library) for efficient file access
- Main R code in `R/` provides user-facing API and column specification system (shared with readr)
- Key C++ components: `delimited_index.cc/.h` (indexing), `altrep.cc/.h` (lazy vectors), `collectors.h` (type-specific parsing), `DateTimeParser.h` (temporal data)
## libvroom Architecture

libvroom (`src/libvroom/`, vendored from `~/p/libvroom`) is a high-performance CSV parser using portable SIMD instructions (via Google Highway), based on a speculative multi-threaded two-pass algorithm from Chang et al. (SIGMOD 2019) and SIMD techniques from Langdale & Lemire (simdjson). It outputs parsed data in Arrow columnar format for zero-copy interop with R.

### Parsing Pipeline

Four-phase pipeline (Polars-inspired):

1. **SIMD Analysis** — Memory-map the file, detect encoding (UTF-8/16/32/Latin1 via BOM + heuristics), detect CSV dialect (delimiter, quote char) via consistency scoring. Dual-state chunk analysis: single SIMD pass computes stats for both starting quote states, then resolves via forward propagation. Optionally caches chunk boundaries to disk (`~/.cache/libvroom/`).
2. **Parallel Chunk Parsing** — Thread pool dispatches chunks to workers. Each thread uses `SplitFields` iterator (SIMD boundary caching, 64 bytes/iteration) and SIMD integer parsing (`simd_atoi`), appending directly to thread-local `ArrowColumnBuilder` instances.
3. **Type Inference** — Sample first N rows; try parsing as BOOL → INT32 → INT64 → FLOAT64 → DATE → TIMESTAMP → STRING, promoting types as needed.
4. **Column Building** — Workers build columns in Arrow format (packed null bitmap with lazy init, contiguous NumericBuffer/StringBuffer). Merge is O(1) for strings (move pointers), O(n) for numerics (copy+append).

### Directory Layout

```
src/libvroom/
├── include/libvroom/ # Public API headers
│ ├── libvroom.h # Umbrella header (version 2.0.0)
│ ├── vroom.h # CsvReader, MmapSource, ChunkFinder, TypeInference
│ ├── types.h # DataType enum, FieldView, ColumnSchema, Result<T>
│ ├── options.h # CsvOptions, ParquetOptions, ThreadOptions
│ ├── arrow_column_builder.h # Arrow-format column builders (int32/64, float64, bool, string, date, timestamp)
│ ├── arrow_buffer.h # NullBitmap, StringBuffer, NumericBuffer<T>
│ ├── table.h # Multi-chunk Arrow table with ArrowArrayStream export
│ ├── error.h # ErrorCode (17 types), ErrorMode (DISABLED/FAIL_FAST/PERMISSIVE/BEST_EFFORT)
│ ├── dialect.h # CSV dialect detection
│ ├── streaming.h # Streaming parser for large files
│ └── ... # ~28 headers total
├── src/
│ ├── parser/ # SIMD field splitting, quote parity (CLMUL), SIMD integer parsing, chunk finding
│ ├── reader/ # CsvReader orchestration, memory-mapped source
│ ├── schema/ # Type inference, type parsers (fast_float, Highway SIMD, ISO8601)
│ ├── columns/ # Legacy ColumnBuilder (being replaced by ArrowColumnBuilder)
│ ├── writer/ # Parquet writer (multi-threaded encoding, Thrift metadata), Arrow IPC writer
│ ├── encoding/ # Parquet encodings: PLAIN, RLE, DELTA_BINARY_PACKED, DELTA_LENGTH_BYTE_ARRAY
│ ├── cache/ # Persistent index cache (Elias-Fano compressed, atomic writes)
│ └── simd/ # SIMD-accelerated statistics
└── third_party/
├── hwy/ # Google Highway — portable SIMD (x86 SSE4.2/AVX2, ARM NEON, scalar fallback)
├── simdutf/ # SIMD UTF-8/16/32 validation and transcoding
├── fast_float/ # Fast double parsing (~3x strtod)
└── BS_thread_pool.hpp # Single-header thread pool
```

### R Integration (src/)

The R package bridges libvroom's Arrow output to R data structures:

| File | Purpose |
|------|---------|
| `vroom_new.cpp` | New libvroom-based `vroom()` entry point: streaming API (`start_streaming()` / `next_chunk()`) |
| `arrow_to_r.cpp/.h` | Converts `ArrowColumnBuilder`s to R data frame; numeric cols copy to R vectors, string cols wrap in Altrep or materialize |
| `altrep.cc/.h` | R Altrep (lazy) vectors backed by Arrow string buffers — near-instant for deferred materialization |
| `vroom_arrow.cpp` | Arrow C Data Interface export (RecordBatch/Stream) |
| `cpp11.cpp` | Generated cpp11 bindings registering C++ functions callable from R |
| `delimited_index.cc/.h` | Legacy two-pass indexer (being replaced by libvroom) |
| `vroom_*.cc/.h` | Legacy per-type column implementations (being replaced) |

Integration flow:
```
R: vroom(path)
→ cpp11: vroom_libvroom_()
→ libvroom: CsvReader::open() + start_streaming()
→ Parallel SIMD parsing → ParsedChunks
→ arrow_to_r: columns_to_r() → R vectors
→ altrep: Wrap strings in Altrep (deferred materialization)
→ R: tibble returned to user
```

### Build (Makevars)

Source categories compiled into `vroom.so`:
- **VROOM_SOURCES** (13 files): Legacy vroom C++ implementation
- **LIBVROOM_SOURCES** (30 files): All libvroom implementation
- **SIMDUTF_SOURCES** (1 file): UTF transcoding
- **HIGHWAY_SOURCES** (3 files): Google Highway SIMD
- **Arrow integration** (5 files): arrow_to_r, vroom_arrow, vroom_new, etc.

Include paths: `-Imio/include`, `-Ispdlog/include`, `-Ilibvroom`, `-Ilibvroom/include`, `-Ilibvroom/third_party`

## R Package API

**Core Functions**
- Reading: `vroom()` (main delimited reader with delimiter guessing), `vroom_fwf()` (fixed-width files), `vroom_lines()` (lazy line reading)
Expand All @@ -55,16 +132,7 @@ General advice:
- `locale()` object controls region-specific settings: decimal mark, grouping mark, date/time formats, encoding, timezone
- Defaults to US-centric locale but fully customizable via `date_names()` and `date_names_langs()`

**Performance & Parsing**
- Multi-threaded indexing, materialization, and writing (controlled by `num_threads` parameter or `VROOM_THREADS` environment variable)
- Altrep lazy evaluation enabled by default for character vectors (controlled by `VROOM_USE_ALTREP_*` environment variables)
- Progress bars for long operations (controlled by `VROOM_SHOW_PROGRESS` environment variable)
- Memory mapping via mio library for efficient file access
- Support for reading from multiple files, connections, URLs, compressed files
- Delimiter guessing, multi-byte delimiters, Unicode delimiters
- Embedded newlines in fields (requires `num_threads = 1`)

**Key Dependencies**
**Key R Dependencies**
- cli: Error messages and formatting
- tibble: Output format
- tzdb: Timezone database for datetime parsing
Expand Down
3 changes: 2 additions & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Package: vroom
Title: Read and Write Rectangular Text Data Quickly
Version: 1.7.0.9000
Version: 1.6.7.9000
Authors@R: c(
person("Jim", "Hester", role = "aut",
comment = c(ORCID = "0000-0002-2739-7082")),
Expand Down Expand Up @@ -74,6 +74,7 @@ Config/Needs/website: nycflights13, tidyverse/tidytemplate
Config/testthat/edition: 3
Config/testthat/parallel: false
Config/usethis/last-upkeep: 2025-11-25
SystemRequirements: C++17
Copyright: file COPYRIGHTS
Encoding: UTF-8
Language: en-US
Expand Down
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ export(spec)
export(starts_with)
export(vroom)
export(vroom_altrep)
export(vroom_arrow)
export(vroom_example)
export(vroom_examples)
export(vroom_format)
Expand Down
41 changes: 23 additions & 18 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,30 +1,35 @@
# vroom (development version)

# vroom 1.7.0
* [vroom.tidyverse.org](https://vroom.tidyverse.org/) is the new home of
vroom's website, catching up to the much earlier move (April 2022) of vroom's
GitHub repository from the r-lib organization to the tidyverse. The motivation
for that was to make it easier to transfer issues between these two closely
connected packages.

* [vroom.tidyverse.org](https://vroom.tidyverse.org/) is the new home of vroom's website, catching up to the much earlier move (April 2022) of vroom's GitHub repository from the r-lib organization to the tidyverse. The motivation for that was to make it easier to transfer issues between these two closely connected packages.
* The `path` parameter has been removed from `vroom_write()`. This parameter was
deprecated in vroom 1.5.0 (2021-06-14) in favor of the `file` parameter (#575).

* The `path` parameter has been removed from `vroom_write()`. This parameter was deprecated in vroom 1.5.0 (2021-06-14) in favor of the `file` parameter (#575).

* The function `vroom_altrep_opts()` and the argument `vroom(altrep_opts =)` have been removed. They were deprecated in favor of `vroom_altrep()` and `altrep =`, respectively, in v1.2.0 (2020-01-13). Also applies to `vroom_fwf(altrep_opts =)` and `vroom_lines(altrep_opts =)` (#575).

* `vroom()` now supports reading from a remote file that uses any of the supported compression formats, by downloading to a temporary (compressed) file. This is a new feature for `.bz2`, `.xz`, and `.zip` and fixes `.gz` bugs arising from problematic behaviour of `base::gzcon()` (#400, #553, https://github.com/tidyverse/readr/issues/1555, https://github.com/tidyverse/readr/issues/1553).

* `mtcars.csv.tar.gz` and `mtcars-concatenated.csv.gz` are 2 new example files that are handy internally, at least, for exercising code related to reading compressed files.

* vroom now offers to install the archive package if it's needed to complete the user's request (https://github.com/tidyverse/readr/issues/1334).

* vroom takes the recommended approach for phasing out usage of the non-API entry points `SETLENGTH`, `SET_TRUELENGTH`, and `ATTRIB` (#582, #596).

* Unclosed quotes (e.g., `a,b,"c` with no closing `"`) now trigger a warning, instead of silent data truncation. The affected row is also newly included in the returned data, which should facilitate troubleshooting (#484, https://github.com/tidyverse/readr/issues/1539, https://github.com/tidyverse/readr/issues/1491).
* The function `vroom_altrep_opts()` and the argument `vroom(altrep_opts =)`
have been removed. They were deprecated in favor of `vroom_altrep()` and
`altrep =`, respectively, in v1.2.0 (2020-01-13). Also applies to
`vroom_fwf(altrep_opts =)` and `vroom_lines(altrep_opts =)` (#575).

* Columns specified as having type "number" (requested via `col_number()` or `"number"` or `'n'`) or "skip" (requested via `col_skip()` or `"skip"` or `_` or `-`) now work in the case where 0 rows of data are parsed (#427, #540, #548).

* `vroom()`, `vroom_lines()`, and `vroom_fwf()` now close and destroy (instead of leak) the connection in the case where opening the connection fails due to, e.g., a nonexistent URL (#488).
* `vroom()`, `vroom_lines()`, and `vroom_fwf()` now close and destroy (instead
of leak) the connection in the case where opening the connection fails due to,
e.g., a nonexistent URL (#488).

* vroom takes the recommended approach for phasing out usage of the non-API
entry points `SETLENGTH` and `SET_TRUELENGTH` (#582).

* If there is insufficient space for the tempfile used when reading from a connection (affects delimited and fixed width parsing, from compressed files and URLs), that is now reported as an error and no longer segfaults (#544).
* If there is insufficient space for the tempfile used when reading from a
connection (affects delimited and fixed width parsing, from compressed files
and URLs), that is now reported as an error and no longer segfaults (#544).

* `vroom(..., n_max = 0, col_names = c(...))` with a connection (compressed file, URL, raw connection) no longer produces a "negative length vectors are not allowed" error or crashes R (#539).
* `vroom(..., n_max = 0, col_names = c(...))` with a connection (compressed
file, URL, raw connection) no longer produces a "negative length vectors are
not allowed" error or crashes R (#539).

* `vroom_fwf(..., n_max = 0)` with a connection no longer segfaults (#590).

Expand Down
Loading