Skip to content

feat: vendor libvroom SIMD CSV parser as subtree#613

Closed
jimhester wants to merge 36 commits intotidyverse:mainfrom
jimhester:issue-52-libvroom-improve-comment-handling-multi-char
Closed

feat: vendor libvroom SIMD CSV parser as subtree#613
jimhester wants to merge 36 commits intotidyverse:mainfrom
jimhester:issue-52-libvroom-improve-comment-handling-multi-char

Conversation

@jimhester
Copy link
Collaborator

No description provided.

jimhester and others added 30 commits February 4, 2026 20:57
Add libvroom — a high-performance CSV parser using portable SIMD (Google
Highway) with a two-pass speculative algorithm. Includes:
- Multi-threaded chunk-based parsing with streaming API
- Arrow columnar output via C Data Interface
- Type inference, dialect detection, encoding detection
- Null bitmap support, string/numeric column builders

Vendored under src/libvroom/ with third-party deps (Highway, simdutf).

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Wire vroom to use libvroom's streaming CsvReader for single-file reads:

- vroom_arrow(): streams parsed chunks via Arrow C Data Interface to
  produce Arrow Tables, with pipeline parallelism (parsing overlaps
  with R-side import)
- vroom(): streams chunks to build tibbles with multi-chunk ALTREP
  for strings (zero-copy, deferred interning) and direct memcpy for
  numeric columns

New files: vroom_arrow.cpp, vroom_new.cpp, arrow_to_r.cpp/h,
vroom_arrow_chr.h, vroom_dict_chr.h, r_abort.cc, test-libvroom.R

Also includes build system (Makevars), R wrappers, documentation,
benchmark scripts, and CI updates.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
git-subtree-dir: src/libvroom
git-subtree-split: 33323a99743088e8361d740fabaee32bd6d203cb
- Trim unneeded files from libvroom subtree (tests, benchmarks,
  docs, python, CLI-only sources, CI config)
- Vendor third-party deps via update-libvroom.sh: Highway (SIMD),
  fast_float, BS::thread_pool, simdutf (amalgamation)
- Add -Ilibvroom to PKG_CPPFLAGS for HWY_TARGET_INCLUDE resolution
- Rewrite update-libvroom.sh to use cmake FetchContent for
  automated dependency vendoring

Co-Authored-By: Claude Opus 4.5 <[email protected]>
feat: integrate libvroom SIMD backend with Arrow output
Replace the old "Key Technical Details" section with detailed
documentation of the vendored libvroom library: SIMD parsing pipeline,
directory layout, R integration layer, and build system. Reorganize
R-facing API docs under a separate "R Package API" heading.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
* feat: support R connections in libvroom SIMD backend

Connections (file(), gzfile(), rawConnection(), etc.) are now read into
raw vectors in R, then passed to C++ where an AlignedBuffer is created
for libvroom's open_from_buffer() API. This avoids falling back to the
old vroom_() parser when connections are used with eligible settings.

Closes #10

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* review: add backend assertions, handle empty connections, clean up style

- Add expect_false("spec_tbl_df" %in% class()) assertions to verify
  libvroom backend is actually used in new connection tests
- Handle empty connections with early return before C++ call
- Restructure can_use_libvroom() to avoid empty if body

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* fix: treat col_types = list() as NULL in test_libvroom helper

The test_libvroom helper was defaulting to col_types = list() which
caused can_use_libvroom() to return FALSE, so these tests were never
actually exercising the libvroom backend. Change the helper default to
col_types = NULL and update test expectations to match libvroom's actual
type inference behavior (integer inference needs multiple rows, no
whitespace trimming).

Keep can_use_libvroom() requiring col_types = NULL (not list()) to avoid
breaking the broader test suite that relies on list() for old-parser
expectations. The col_types = list() equivalence can be addressed
separately once type inference parity is resolved.

Co-Authored-By: Claude Opus 4.5 <[email protected]>

---------

Co-authored-by: Claude Opus 4.5 <[email protected]>
* fix: correct libvroom type inference for single-row files and integer default

Remove redundant header-row skip in infer_from_sample() that caused
single-row files to have their only data row skipped (all types fell
through to STRING). Callers already pass data + header_end_offset.

Add guess_integer option (default false) to CsvOptions for R parity:
the old parser defaults to guess_integer = FALSE, meaning "1" infers
as double. Without this, libvroom always tried INT32 first.

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* feat: add whitespace trimming to libvroom chunk parser

libvroom's SplitFields iterator returned raw field data without trimming
whitespace, causing "  1  " to fail numeric parsing and "  hello  " to
retain surrounding spaces. Add trim_ws option (default true) to CsvOptions
and trim leading/trailing spaces and tabs in both parse_chunk_with_state
(parallel path) and read_all_serial (serial path) before creating the
field_view. Pass trim_ws from R through the C++ binding.

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* feat: accept col_types=list() in libvroom backend and fix use_libvroom semantics

- can_use_libvroom() now accepts col_types=list() as equivalent to NULL
  (both mean "guess all column types"), allowing the libvroom backend to
  handle this common calling convention.
- Change use_libvroom default from FALSE to NULL. NULL auto-detects,
  TRUE prefers libvroom, FALSE forces the legacy parser. Previously the
  parameter was always overwritten by can_use_libvroom(), so users could
  not opt out.
- Update test_vroom helper and affected tests to use use_libvroom=FALSE
  so legacy-parser-specific tests continue exercising the old backend.

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* fix: address code review findings

- Add RFC 4180 comments explaining trim-before-unquote ordering
- Warn when use_libvroom=TRUE but can_use_libvroom() returns FALSE
- Suppress unused return value warning on unique_ptr::release()

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* docs: regenerate Rd for use_libvroom default change

Co-Authored-By: Claude Opus 4.5 <[email protected]>

---------

Co-authored-by: Claude Opus 4.5 <[email protected]>
* feat: support R connections in libvroom SIMD backend

Connections (file(), gzfile(), rawConnection(), etc.) are now read into
raw vectors in R, then passed to C++ where an AlignedBuffer is created
for libvroom's open_from_buffer() API. This avoids falling back to the
old vroom_() parser when connections are used with eligible settings.

Closes #10

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* review: add backend assertions, handle empty connections, clean up style

- Add expect_false("spec_tbl_df" %in% class()) assertions to verify
  libvroom backend is actually used in new connection tests
- Handle empty connections with early return before C++ call
- Restructure can_use_libvroom() to avoid empty if body

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* fix: treat col_types = list() as NULL in test_libvroom helper

The test_libvroom helper was defaulting to col_types = list() which
caused can_use_libvroom() to return FALSE, so these tests were never
actually exercising the libvroom backend. Change the helper default to
col_types = NULL and update test expectations to match libvroom's actual
type inference behavior (integer inference needs multiple rows, no
whitespace trimming).

Keep can_use_libvroom() requiring col_types = NULL (not list()) to avoid
breaking the broader test suite that relies on list() for old-parser
expectations. The col_types = list() equivalence can be addressed
separately once type inference parity is resolved.

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* feat: add FWF support to libvroom SIMD backend

Add fixed-width file reading to the libvroom SIMD-accelerated backend,
mirroring how vroom_libvroom_() handles CSV. FWF is simpler than CSV
(no quoting, no delimiter detection), so parallel parsing just needs
newline scanning for chunk boundaries.

New files:
- src/libvroom/src/reader/fwf_reader.cpp: core FWF reader with serial
  and parallel parsing via thread pool
- src/vroom_fwf_libvroom.cpp: R-to-C++ bridge

Modified files:
- options.h: FwfOptions struct
- parse_utils.h: NullChecker overloads for FwfOptions and string_view
- vroom.h: FwfReader class declaration
- vroom_fwf.R: can_use_libvroom_fwf() gate and dispatch
- Makevars: build new sources
- test-vroom_fwf.R: 12 new test cases

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* feat: support R connections in libvroom FWF backend

Widen the FWF dispatch gate to accept connections (file(), gzfile(),
rawConnection()) matching the CSV backend pattern. R reads connection
data into a raw vector via read_connection_raw(), then C++ creates an
AlignedBuffer and parses via FwfReader::open_from_buffer().

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* feat: support skip, n_max, and id in libvroom FWF backend

Add skip and max_rows fields to FwfOptions and implement them in
FwfReader. skip advances past N data lines after comment-skipping.
max_rows limits parse_fwf_chunk() and forces serial parsing to
coordinate the global row count. The id column is handled in R by
prepending the file path (or NA for connections) after libvroom returns.

Remove skip, n_max, and id checks from can_use_libvroom_fwf() so only
col_types remains as a fallback trigger to the old parser.

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* refactor: use connection_or_filepath() in libvroom FWF dispatch

Call connection_or_filepath() in the dispatch block to convert URLs,
compressed files, and other special paths into connections before
passing to libvroom. This eliminates the URL/compressed/empty-file
gates from can_use_libvroom_fwf(), leaving col_types as the only
remaining fallback to the old parser.

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* refactor: address code review feedback

- Extract shared open()/open_from_buffer() logic into initialize_data()
- Fix skip_empty_rows not being respected during type inference
- Remove unused use_altrep parameter from FWF bridge

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* fix: compute actual row count from parsed data, not newline estimate

count_newlines() overestimates when there are empty/comment lines,
which inflated total_rows passed to columns_to_r_chunked(). Now compute
actual_rows from column builder sizes after collecting all chunks.

Also consolidates the two empty-result code paths into one.

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* fix: merge main, support col_types=list() in FWF backend

- Merge PR #14 (libvroom parity fixes) into FWF branch
- Add col_types=list() support to can_use_libvroom_fwf()
- Fix col_select to use names directly (no spec attribute)
- Update test expectations for trim_ws and type inference changes

Co-Authored-By: Claude Opus 4.5 <[email protected]>

* fix: handle non-ASCII file paths in libvroom FWF backend

When running in non-UTF-8 locales (e.g., ISO-8859-1), file paths with
non-ASCII characters get mangled when passed to libvroom's C++ code.

Fix by detecting non-ASCII paths and reading them via R connections,
which handle encoding correctly.

Co-Authored-By: Claude Opus 4.5 <[email protected]>

---------

Co-authored-by: Claude Opus 4.5 <[email protected]>
* feat: add set_schema() to CsvReader and FwfReader in libvroom

Allows callers to override inferred column types after open().
Only overrides types for columns where the provided schema has
a non-UNKNOWN type, preserving inferred types for unspecified columns.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* feat: wire explicit col_types through vroom() to libvroom CSV backend

- Add col_types_to_libvroom() R helper mapping collector classes to
  libvroom DataType integers
- Remove col_types guard from can_use_libvroom() so explicit types
  no longer force fallback to legacy parser
- Update vroom_libvroom_() C++ function to accept col_types and
  col_type_names parameters, applying them via reader.set_schema()
- Support both positional and named type matching

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* feat: col_skip, named cols(), cols_only(), and R-side post-processing

- Handle col_skip by dropping skipped columns from libvroom output
- Support cols() with named columns (partial specification)
- Support cols_only() by dropping unmatched columns
- Support .default in list-based col_types
- Add R-side post-processing for types not native to libvroom:
  col_number(), col_factor(), col_time(), col_big_integer(),
  col_date() with custom format, col_datetime() with custom format

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* feat: wire col_types through vroom_fwf() to libvroom FWF backend

- Remove col_types guard from can_use_libvroom_fwf()
- Resolve col_types spec and pass to C++ as integer vector
- Support compact notation, named cols(), cols_only(), col_skip
- Add R-side post-processing for unmapped types
- Update vroom_libvroom_fwf_() C++ function with col_types parameters

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* feat: add spec and problems attributes to libvroom output

- Add build_libvroom_spec() helper to construct col_spec from
  resolved spec or inferred R column types
- Attach spec attribute to both CSV and FWF libvroom output
- Attach empty problems tibble (libvroom doesn't track parse errors yet)
- Update problems() to handle tibble-format problems attribute
- Update test to handle libvroom backend not emitting parse warnings

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* fix: add smart col_types guard to prevent libvroom regressions

Only use libvroom backend when all col_types map to native libvroom
types (integer, double, character, logical, date, datetime, skip, guess).
Fall back to legacy parser for types requiring R-side post-processing:
factor, number, time, big_integer, and custom date/datetime formats.

This prevents regressions in tests that depend on legacy parser features
like problems tracking, Altrep factor subsetting, and error handling.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: resolve 39 test regressions from libvroom col_types support

- R/vroom.R: Add comment character check to can_use_libvroom() — fall
  back to legacy parser when comment is non-empty since libvroom handles
  mid-field comments differently (truncates at # even inside fields)
- test-connection.R: Update spec_tbl_df class expectations from
  expect_false to expect_true now that libvroom output includes the spec
  attribute and spec_tbl_df class
- test-problems.R: Force legacy parser (use_libvroom = FALSE) for tests
  that depend on vroom_parse_issue warnings, since libvroom doesn't
  track parse problems yet

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* fix: address code review feedback

- Fix .default expansion bug: non-guess/non-skip .default now falls back
  to legacy parser since we can't know column count before the C++ call.
  Previously, .default types were expanded after C++ had already inferred
  types, silently producing wrong types when inference disagreed.
- Fix .default gate for custom date/datetime formats: the gate now rejects
  ALL non-guess/non-skip defaults, not just factor/number/time/big_integer.
- Rename misleading test names: tests for col_number/factor/time/big_integer
  that actually fall back to legacy parser are now named accordingly.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: build spec before col_select to reflect full file schema

The spec attribute was built after col_select filtering, so it only
contained selected columns instead of the full file schema. Move spec
construction before col_select to match legacy parser behavior.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: add missing file column to libvroom problems schema

The legacy backend's problems() returns 5 columns (row, col, expected,
actual, file) but the libvroom problems tibble only had 4. Add the file
column for schema parity with downstream code expectations.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>
Extract the `!(!is.null(resolved_spec) && inherits(..., "collector_skip"))`
condition into a named variable `already_handled_by_cols_only` for readability.

Closes #19

Co-authored-by: Claude Opus 4.6 <[email protected]>
* feat: add spec_from_df() helper to build col_spec from data frame types

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* feat: show col_types in libvroom path of vroom()

The libvroom SIMD backend now builds a col_spec from inferred types,
attaches it as the spec attribute with spec_tbl_df class, and respects
the show_col_types parameter — matching the behavior of the legacy path.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* test: add show_col_types tests for libvroom FWF path

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* refactor: remove dead spec_from_df() (build_libvroom_spec covers this)

Co-Authored-By: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>
* test: add characterization tests for shared helper refactoring

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* refactor: extract apply_libvroom_col_select() helper

Deduplicates the identical col_select logic from vroom() and vroom_fwf()
libvroom paths into a shared helper in col_types.R.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* refactor: extract col_types resolution, filtering, and post-processing helpers

Deduplicates:
- resolve_libvroom_col_types(): col_types → int vector + names
- filter_cols_only_and_skip(): cols_only and skip-mask column filtering
- finalize_libvroom_result(): problems attribute + spec_tbl_df class
- postprocess_result(): legacy path tibble + select + class + show

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* refactor: extract C++ open_input_source(), apply_schema_overrides(), empty_tibble_from_schema()

Deduplicates the input-open dispatch, schema override application,
and empty-tibble creation from vroom_new.cpp and vroom_fwf_libvroom.cpp
into shared libvroom_helpers.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: restore original guard logic in filter_cols_only_and_skip()

- cols_only branch: remove length(col_type_names) > 0 guard and use
  names(resolved_spec$cols) directly (matches original vroom_fwf.R)
- skip-mask branch: use length(col_type_names) == 0 instead of
  !is_cols_only (matches original vroom.R)

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: address review feedback on filter_cols_only_and_skip()

- Add guard on cols_only branch: check length(names(resolved_spec$cols)) > 0
  to preserve vroom.R's original defensive check against empty spec names
- Pass id explicitly in vroom.R's apply_libvroom_col_select call for
  consistency with vroom_fwf.R

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: simplify filter_cols_only_and_skip() guards

Use is_cols_only/!is_cols_only as the branch condition rather than
trying to replicate the inconsistent guards from the original code.
Named col_skip() in non-cols_only specs (e.g., cols(x = col_integer(),
y = col_skip())) now correctly filters skipped columns in both
vroom() and vroom_fwf().

Co-Authored-By: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>
Allow `cols(.default = col_character())` and similar to use the libvroom
SIMD backend instead of falling back to the legacy parser.

- Add `default_col_type` parameter to C++ `vroom_libvroom_()` so the
  default type is applied after header parsing determines column count
- Extract `collector_to_libvroom_int()` helper for reuse
- Replace blanket `.default` rejection in `can_libvroom_handle_col_types()`
  with per-type compatibility checks (same rules as explicit columns)
- Handle `.default` expansion in `build_libvroom_spec()` for the spec
  attribute and post-processing in `apply_col_postprocessing()`

Closes #25

Co-authored-by: Claude Opus 4.6 <[email protected]>
…#33)

* Support col_names = FALSE and custom column names in libvroom backend (#24)

Remove the col_names gate in can_use_libvroom() and add R-side column
renaming after the C++ call. When col_names is FALSE, columns are named
X1, X2, etc. When col_names is a character vector, the provided names
are used. Named col_type_names are translated to libvroom's internal
V-names so type overrides apply correctly.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* Tighten can_use_libvroom() locale gates to prevent encoding/locale regressions

When the col_names gate was removed, tests using col_names = FALSE with
non-default locale settings (encoding, decimal mark, date format) or
raw byte inputs started going through libvroom, which can't handle them.

Add gates for: non-UTF-8 encoding, non-default decimal mark, non-default
date format, and raw byte connections.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* Remove unused col_names parameter from can_use_libvroom()

The col_names gate was removed in the previous commit, leaving the
parameter unused. Clean up the signature and call site.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>
* test: add failing tests for libvroom multi-file and id column support

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* feat: support multi-file reading and id column in libvroom backend

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: address review issues for multi-file libvroom

Co-Authored-By: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>
Route compressed files (.gz, .bz2, .xz, .zip, .zst) and remote URLs
through the libvroom SIMD backend via connection_or_filepath(), mirroring
the pattern already used by vroom_fwf(). Remove the URL and compressed-
file gates from can_use_libvroom().

Co-authored-by: Claude Opus 4.6 <[email protected]>
* Un-gate comment character for libvroom backend

Remove the conservative return(FALSE) gate in can_use_libvroom() for
comment characters. libvroom's CsvOptions already has a comment field,
csv_reader.cpp already implements skip_leading_comment_lines(), and
vroom_new.cpp already passes the comment through.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* Un-gate delimiter auto-detection for libvroom backend

Remove the return(FALSE) gate in can_use_libvroom() that required an
explicit delimiter. libvroom's DialectDetector auto-detects the CSV
dialect when separator='\0'. When delim=NULL, R passes an empty string
which leaves the default sentinel value, triggering auto-detection.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* Un-gate skip for libvroom backend

Add skip field to CsvOptions, implement skip_n_lines() in csv_reader.cpp
(applied in both open() and open_from_buffer() after encoding/dialect
detection but before comment line skipping), and wire up the skip
parameter from vroom_new.cpp. Remove the R-side gate in
can_use_libvroom().

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* Un-gate n_max for libvroom backend

Remove the return(FALSE) gate in can_use_libvroom() for n_max. Implement
R-side row truncation via seq_len() after vroom_libvroom_() returns,
before any post-processing.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* Add edge-case tests for skip with CRLF and skip+comment combination

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* Move skip before dialect auto-detection in csv_reader

Apply skip_n_lines() before auto_detect_dialect() so that dialect
detection analyzes the actual CSV data rather than skipped metadata
lines. This ensures correct delimiter detection when skip > 0 and
delim = NULL are combined.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* Keep delim and comment gates; add skip-all-data fallback

The delimiter auto-detection and comment character gates remain because:
- Removing delim=NULL gate causes many tests using I() to route through
  libvroom, exposing missing features (problems(), spec(), mid-field
  comment handling)
- libvroom only handles full-line comments, not mid-field comments

Add tryCatch around vroom_libvroom_() to gracefully fall back to the
legacy parser when skip consumes all data (e.g., skip=1 on a file with
only a header line).

The skip and n_max gates are successfully removed in this PR. Tests
using use_libvroom=TRUE confirm the underlying delimiter auto-detection
and comment capabilities work correctly.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>
* test: add failing tests for libvroom problems() tracking

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* feat: enable PERMISSIVE error mode and attach problems from libvroom

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* feat: wire libvroom problems through R-side to problems() API

Co-Authored-By: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>
* Deprecate escape_backslash parameter with lifecycle warning

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* Support non-UTF-8 encoding in libvroom via R-level transcoding

Remove the encoding gate from can_use_libvroom() and transcode
non-UTF-8 files to UTF-8 at the R level before passing them to
libvroom. This handles both ASCII-compatible encodings (Latin-1,
Windows-1252) and non-ASCII-compatible ones (UTF-16).

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* Remove escape_backslash deprecation; defer to follow-up issue

Backslash escaping is widely used in TSV files (MySQL/PostgreSQL
exports). Rather than deprecating, libvroom needs native support
for escape_backslash. Remove the deprecation warning and tests;
the escape_backslash gate in can_use_libvroom() remains until
libvroom gains native support.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>
* feat: add vroom_lines_libvroom_() C++ entry point

New thin C++ function that uses libvroom CsvReader with SOH delimiter
to parse lines as a single string column. Supports Altrep (lazy) and
materialized paths. Returns character vector directly.

n_max is handled on the R side (same pattern as vroom()) since
CsvOptions does not have max_rows.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* Wire up vroom_lines() to dispatch to libvroom backend

Replace the legacy vroom_() call in vroom_lines() with the new
vroom_lines_libvroom_() C++ function. The new implementation handles
compressed/remote files via connections, empty files, and applies
n_max truncation on the R side since CsvOptions doesn't support
max_rows.

Fix NullChecker and LineParser to not treat empty strings as null
when null_values is explicitly empty (no NA values specified). This
ensures vroom_lines(na = character()) correctly returns empty strings
rather than NA for blank lines.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* test: add edge case tests for vroom_lines libvroom backend

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: support multiple file inputs in vroom_lines()

Iterate over all elements of `file` (from standardise_path) instead
of only taking `file[[1]]`. This matches the legacy behavior where
vroom_() would process all inputs.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* feat: move n_max handling into C++ for early chunk termination

Instead of parsing the entire file and truncating R-side, pass n_max
to vroom_lines_libvroom_() which stops collecting chunks once enough
rows are accumulated. Both ALTREP and materialized paths benefit from
early termination.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: decrement n_max budget across files in multi-file loop

When multiple files are provided and n_max is finite, decrement the
remaining row budget after each file and break early once exhausted.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: cap Materialize() iteration by nrows to prevent buffer overflow

When nrows < total chunk rows (e.g. due to n_max capping), Materialize()
would iterate over full chunk sizes, writing past the allocated vector.
Now caps per-chunk iteration using rows_remaining.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>
Implement SIMD-accelerated backslash escape handling using simdjson's
odd/even backslash run technique. This allows the libvroom backend to
parse CSV files that use backslash escaping (escape_backslash=TRUE)
instead of only doubled-quote escaping.

Co-authored-by: Claude Opus 4.6 <[email protected]>
Add pkgdown/extra.css that moves "Segoe UI Emoji" to the front of the
font stack. Bootstrap 5 lists "Segoe UI" before "Segoe UI Emoji", which
causes Windows to render the racecar emoji (🏎) facing the wrong direction.

Closes #44

Co-authored-by: Claude Opus 4.6 <[email protected]>
* Remove legacy C++ reading code, update build system, regenerate bindings

Delete ~40 legacy C++ source/header files including:
- Legacy entry points (vroom.cc, vroom_fwf.cc)
- Indexing layer (delimited_index, fixed_width_index, index_collection)
- Per-type ALTREP columns (vroom_chr, vroom_int, vroom_dbl, etc.)
- Infrastructure (columns.h, collectors.h, DateTime.h, etc.)

Clean up altrep.cc to keep only vroom_convert() and vroom_str_() which
work generically with all ALTREP vectors. Remove legacy-specific
force_materialization(), vroom_altrep(), and vroom_materialize().

Update Makevars to remove legacy files from VROOM_SOURCES and OBJECTS.
Regenerate cpp11 bindings (R/cpp11.R and src/cpp11.cpp).

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* Remove use_libvroom parameter and legacy fallback from vroom()

- Remove use_libvroom parameter from function signature and roxygen docs
- Make the libvroom SIMD backend the only reading path
- Remove can_use_libvroom() and can_libvroom_handle_col_types() functions
- Remove FALLBACK_TO_LEGACY sentinel logic
- Delete entire legacy vroom_() fallback code path

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* Remove legacy fallback from vroom_fwf(), simplify problems() and col_types

- Remove can_use_libvroom_fwf() and legacy vroom_fwf_() fallback path
- Remove vroom_materialize() call and externalptr handling from problems()
- Remove dead postprocess_result() and vroom_select() from col_types.R

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* Remove use_libvroom references from tests and benchmarks

- Rewrite test_vroom() helper to remove legacy parser path
- Remove all use_libvroom parameter usage from test-vroom.R,
  test-problems.R, and test-libvroom.R
- Replace vroom_materialize() calls with equivalent alternatives
- Simplify libvroom_benchmark.R to remove legacy comparison
- Update taxi_writing benchmarks to use lapply materialization

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* Fix compilation and restore utility functions after legacy removal

- Restore DateTime.h and DateTimeParser.h (needed by guess_type.cc)
- Extract parsing helpers from deleted ALTREP files into parse_helpers.h/cc
- Create vroom_utils.cpp with has_trailing_newline, utctime_, and
  whitespace_columns_ (still needed by R code)
- Fix vroom_types.h (remove broken include of deleted vroom_errors.h)
- Update guess_type.cc includes for new file layout
- Regenerate cpp11 bindings and documentation
- Accept snapshot changes for updated error messages

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* Fix R-side glue code and update tests for libvroom-only reading

With the legacy C++ reading codepaths removed, this commit fixes the
R-side post-processing layer and updates tests to work with libvroom
as the sole parser backend.

R code fixes:
- Add C++ parse_datetime_/parse_date_/parse_time_ using preserved
  DateTimeParser for full readr format compatibility
- Fix factor parsing (levels=NULL, include_na, first-appearance ordering)
- Add parse_number_value() matching readr's parse_number() behavior
- Fix col_names=FALSE V-name translation for named col_types
- Thread locale through to apply_collector for timezone support
- Handle empty data edge cases, col_select with id column
- Wrap C++ parsers in tryCatch so malformed formats produce NA
- Preserve original file paths for compressed multi-file reads

Test updates (47 skips for known libvroom limitations):
- Locale-aware parsing (decimal marks, month/day names, AM/PM)
- Readr format extensions (%., %s, %h, %Z, partial dates)
- Type inference gaps (T/F logical, Inf/-Inf double)
- Multi-character/byte delimiters, comment handling
- NA handling with na=character(), parse error reporting

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* Fix DST nonexistent time test to work cross-platform

The test for nonexistent DST times (spring-forward gap) failed on
macOS/Windows because `as.POSIXct(str, tz=tz)` (no format) adjusts the
time, while `as.POSIXct(str, format=fmt, tz=tz)` returns NA. Since vroom
uses the format-based round-trip internally, compute the expected value
the same way so both sides match on all platforms.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>
* feat: recognize T/F/t/f as logical values in type inference

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* feat: respect trim_ws option during type inference

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* feat: recognize Inf/+Inf/-Inf as doubles in type inference

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* feat: forward guess_max parameter to libvroom type inference

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* style: format R code with air

Co-Authored-By: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>
…d date guessing (#47) (#56)

Enable 7 previously-skipped tests by forwarding locale settings through
the libvroom pipeline:

- col_double(): Add decimal_mark to CsvOptions and FastArrowContext, using
  fast_float::from_chars_advanced() for native comma-as-decimal support
- col_number(): Pass locale grouping_mark and decimal_mark to
  parse_number_value(), normalizing decimal marks before regex extraction
- Date/time locales: Remove skip guards for French month/day names and
  Korean AM/PM markers (already worked via STRING -> parse_date_/parse_time_)
- Locale date_format guessing: Post-process guessed STRING columns using
  locale$date_format when non-default

Co-authored-by: Claude Opus 4.6 <[email protected]>
Change CsvOptions::separator from char to std::string so that
multi-character (e.g. "||") and multi-byte Unicode (e.g. "❤")
delimiters work end-to-end. The SIMD fast path for single-byte
delimiters is completely unchanged — multi-byte matching only
activates via a scalar fallback when separator.size() > 1.

Co-authored-by: Claude Opus 4.6 <[email protected]>
jimhester and others added 4 commits February 6, 2026 13:28
…rors (#51) (#59)

* Fix NA handling with na=character() and report type-coercion parse errors (#51)

Two related fixes that resolve 7 test skips:

1. NA handling: When na=character() (nothing should be NA), libvroom
   still treated empty fields as NA. Fixed by always passing null_values
   to libvroom (not skipping when empty), changing NullChecker default
   empty_is_null_ to false, and encoding na values correctly on the R
   side to distinguish character() from c("").

2. Parse error reporting: FastArrowContext::append_* functions silently
   pushed null on type coercion failures. Added TYPE_COERCION_FAILURE
   error code, error reporting fields to FastArrowContext, and wired up
   error collectors in both serial and parallel parsing paths. R-side
   post-processed types (logical, number, big_integer, factor, time)
   now also detect and report parse failures.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* Apply NA handling fix to vroom_arrow and vroom_fwf_libvroom entry points

Remove the `if (!na_values.empty())` guard in both files to match
the fix already applied in vroom_new.cpp. Without this, na=character()
would not work correctly via the FWF and Arrow code paths.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>
Move strptime-style format string parsing from R-side DateTimeParser.h
into libvroom's FormatParser, making it usable standalone without R
dependencies. Add TIME data type and format-aware column builders.

Key changes:
- Add FormatLocale and FormatParser to libvroom (no R deps)
- Add TIME DataType and ArrowTimeColumnBuilder
- Add format-aware date/timestamp/time column builders
- Route all date/time parsing through FormatParser when available
- Support compact ISO8601 formats (HHMM, HH:MM without seconds)
- Pass R locale settings (month names, AM/PM, etc.) to FormatParser
- Remove 5 test skips that were blocking locale and format parsing tests

Co-authored-by: Claude Opus 4.6 <[email protected]>
)

Replace skip_leading_comment_lines() with skip_leading_blank_and_comment_lines()
which also skips blank and whitespace-only lines before the header row, matching
the legacy vroom parser behavior.

Co-authored-by: Claude Opus 4.6 <[email protected]>
Use an RLE-encoded Altrep string vector for the id column instead of
rep(), providing O(n_files) storage instead of O(n_rows) for single-file
reads and during per-file construction in multi-file reads.

Co-authored-by: Claude Opus 4.6 <[email protected]>
@jennybc
Copy link
Member

jennybc commented Feb 11, 2026

@jimhester I think your claude is running amok again 😅 (unless you mean to make and update this PR)

jimhester and others added 2 commits February 11, 2026 16:27
#52)

Add support for multi-character comment markers (e.g., "##", "//"),
inline/end-of-line comment stripping (quote-aware), and proper handling
of unmatched quotes within comment lines. This enables 5 previously
skipped tests to pass.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
In infer_fwf_types, after scanning past a comment line to the newline
character, the code did `continue` without advancing past the \n/\r\n
line ending. This could cause the next loop iteration to see the
newline as a zero-length data line. Match the newline-skipping pattern
already used in parse_fwf_chunk.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@jimhester jimhester force-pushed the issue-52-libvroom-improve-comment-handling-multi-char branch from 2ba6c5b to 883f24f Compare February 11, 2026 21:28
@jimhester jimhester closed this Feb 11, 2026
@jimhester
Copy link
Collaborator Author

Sorry Jenny :( It is like a naughty child sometimes, doesn't listen to directions 🤬

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants