feat: Add opt-in PCRE2 regex engine and Java-regex compatibility test suite by baibaichen · Pull Request #17685 · facebookincubator/velox

baibaichen · 2026-06-01T16:51:32Z

Proposal

Add PCRE2 as a second regex engine in Velox, enabled at compile time via a new CMake option (default OFF). The existing RE2 path is unchanged; users that need fuller java.util.regex semantics — primarily Spark / Presto front-ends running through Gluten — can opt in at build time.

The PR also ships:

a Java-regex-syntax translator library that also benefits RE2 (lifts RE2's OpenJDK 17 corpus pass-rate from 528/860 → 757/860, +27 pp), and
a multi-backend test harness that uses an embedded JVM as Java-semantic oracle.

Why a second engine, not a replacement

Past discussion ( #9897, #9823 ) repeatedly raised the RE2 vs java.util.regex gap and equally raised legitimate concerns about every proposed alternative. PCRE2 is qualitatively different from Oniguruma / Hyperscan, and this PR addresses each historical objection directly:

Five concerns, one by one

ReDoS / non-linear runtime — @mbasmanova on #9823: "It is not realistic to expect users to 'check the provided patterns carefully'."

PCRE2 has built-in LIMIT_MATCH / LIMIT_DEPTH / LIMIT_HEAP that the engine itself enforces. Our Pcre2Regex wrapper sets conservative defaults and exposes them as Options fields. Hitting a limit surfaces a typed error rather than running forever. This is the same protection Java 9+ added to java.util.regex. Oniguruma and Hyperscan do not have this.

Oniguruma trades reliability for coverage — @mbasmanova on #9897.

We do not ship Oniguruma.

Hyperscan effectively abandoned — @philo-he on #9823.

We do not ship Hyperscan. PCRE2 by contrast is actively maintained (PHP, Apache HTTPD, nginx, git all depend on it; releases roughly every 6 months).

Velox SQL is single-pattern per row; multi-pattern engines don't pay off — @mbasmanova on #9823.

PCRE2 is a single-pattern engine; matches that constraint.

User must opt in if accepting any risk — @FelixYBW on #9823: "In Gluten we use a config to enable re2 offload. Gluten user needs to take the risk if they enable."

New CMake option VELOX_ENABLE_PCRE2_BACKEND=OFF by default; identical opt-in pattern. Default cmake ... adds zero new dependencies.

What's in the PR

1. PCRE2 engine wrapper — `velox/external/regex_compat/Pcre2Regex.{h,cpp}`

Production-quality wrapper around PCRE2 with a re2::RE2-shaped surface so call sites that opt in can switch with minimal churn:

JIT-compiles on construction (falls back to interpreter where unavailable).
Honors Options.{caseSensitive, dotNl, oneLine, ...} matching Velox's existing Re2Regex options.
Sets PCRE2's built-in match / depth / heap limits to safe defaults; exposed as Options fields so callers can tune them. Hitting a limit surfaces a typed error, not a hang.
Integrates the translator below: every Java pattern goes through toPcre2Pattern before pcre2_compile, so Java-specific syntax (\p{InGreek}, \p{javaXxx}, (?U) flag inversion, char-class intersection, …) just works.

2. Java→PCRE2/RE2 translator — `velox/functions/lib/java_pcre2_translator/`

Reusable C++ library that rewrites a java.util.regex pattern into PCRE2- or RE2-compatible syntax:

1:1 port of pcre4j PR (feat) regex: java.util.regex compatibility translator + :compat-test harness (68% -> 93.7%) alexey-pelykh/pcre4j#606. I authored that pcre4j PR (wrote the org.pcre4j.regex.translate module from scratch and contributed it upstream); pcre4j does not require a CLA, so I retain copyright on that contribution and am dual-licensing the C++ port here under Apache-2.0. Per-file headers carry the attribution; LICENSE-NOTICE.md documents provenance.
Adds toRe2Pattern that reuses the same parser/evaluator/property map and additionally rejects features RE2 fundamentally cannot represent (lookaround, backref, possessive, atomic, (?U) semantic inversion) — throws EvaluationFailedException rather than silently changing semantics.
Builds unconditionally (it's a functions/lib/ member); production call sites can adopt it independently of the PCRE2 backend.

Headline result for RE2: when the existing Re2Regex is wired through this translator, RE2's OpenJDK 17 corpus pass-rate jumps from 528/860 (61.40%) to 757/860 (88.02%) without changing the engine. That benefit alone — for anyone who wants to stay on RE2 — is the second reason to merge.

3. Test harness — `velox/external/regex_compat/tests/`

Three-backend GTest framework (Re2Regex, Pcre2Regex, JavaRegex via embedded JVM as oracle), parameterised via TYPED_TEST_SUITE. Runs:

OpenJDK 17 corpus (4 files, 860 + 5 cases) fetched at CMake configure time, SHA-pinned to jdk-17.0.13-ga.
96 ported tests from OpenJDK RegExTest.java.
178 translator unit tests (1:1 from pcre4j).

A TestMain guard asserts the Java backend remains 100% and loudly warns otherwise — Java is java.util.regex, so any failure there means our JNI bridge / adapter is wrong, not an engine difference. The harness is gated behind VELOX_ENABLE_REGEX_COMPAT_TESTS=OFF with a further sub-option VELOX_ENABLE_REGEX_COMPAT_JAVA_BACKEND=OFF that auto-disables if no JDK is found.

Default-build impact

Zero. With the default cmake ... invocation:

No PCRE2 download or link.
No JNI / JDK probe.
No new code in any existing target.

# default — nothing changes
cmake -S . -B build

# enable PCRE2 engine for production use (adds PCRE2 10.47 via FetchContent)
cmake -S . -B build -DVELOX_ENABLE_PCRE2_BACKEND=ON

# also run the compatibility test suite
cmake -S . -B build -DVELOX_ENABLE_PCRE2_BACKEND=ON \
                    -DVELOX_ENABLE_REGEX_COMPAT_TESTS=ON

# include the Java JNI oracle in the test suite (needs JDK >= 17)
cmake -S . -B build -DVELOX_ENABLE_PCRE2_BACKEND=ON \
                    -DVELOX_ENABLE_REGEX_COMPAT_TESTS=ON \
                    -DVELOX_ENABLE_REGEX_COMPAT_JAVA_BACKEND=ON

Motivation — gap is documented, recurring, and previously unmeasured

Support Oniguruma-based regex functions. #9897 (open) — Support Oniguruma-based regex functions. @spershin lists 8 concrete RE2-vs-Java semantic discrepancies seen in production Presto workloads. Status: prototype only, no shipped fix in either direction.
why velox supports regex function using re2 instead of hyperscan. #9823 (closed) — repeated public acknowledgement that "re2 patterns and results aren't 100% compatible with Spark" ( @FelixYBW ), with maintainers explicitly noting that no proposed alternative met Velox's reliability bar.
fix(regexp_replace): Move regex preprocessing to functions/lib for Spark reuse and fix backslash handling #10981 (merged) — moved prepareRegexpReplacePattern / prepareRegexpReplaceReplacement into functions/lib for Spark reuse. That helper covers one rewrite ((?<name>) → (?P<name>)); everything else (char-class intersection, \p{InGreek}, \p{javaXxx}, (?U) semantic inversion, \Q\E edges, …) silently falls on the floor.
fix(spark): Fix regexp_replace with backslash-letter replacements #17578 (merged) / Spark's regexp_replace drops LF bytes when input is a NativeScan string column #17577 (closed) — most recent silent-data-correctness regression in regexp_replace. Root cause: the one-shot preprocessing introduced in fix(regexp_replace): Move regex preprocessing to functions/lib for Spark reuse and fix backslash handling #10981 had no compatibility regression set behind it. Exactly the class of bug a Java-corpus regression suite catches up front.
ExpressionFuzzer fails on like: Max number of regex reached #7824 / kMaxCompiledRegexes leads to non-deterministic SparkSQL regexp_replace behavior #8438 / fix: Use EvictingCacheMap for compiled regular expressions #15953 — recurring kMaxCompiledRegexes / cache-eviction issues. No shared regression corpus existed to validate fixes; this PR provides one.
refactor: Replace boost regex with re2 #15134 / fix: Replace boost::regex with RE2 in parse_duration function #15124 / fix: Replace boost::regex with RE2 regex in url_extract_parameter and fb_url_extract_parameter Presto function in Velox #14681 / build: Fix TpcdsConnector error "'this' pointer is null" #14878 / fix: Compile RE2 regex once in isColumnNameRequiringEscaping #16973 — the long stream of "replace boost::regex with RE2" migrations; each PR needed ad-hoc per-call-site verification because no shared compatibility harness existed.

Validation

OpenJDK 17 corpus (860 cases)             Java   860 / 860  100.00%
                                          Pcre2  860 / 860  100.00%   ←  drop-in Java parity
                                          Re2    757 / 860   88.02%   (98.18% on translatable subset *)
                                          Re2 raw (no translator)  528 / 860  61.40%

OpenJDK RegExTest.java (96 ported)        Java    96 /  96  100.00%
                                          Pcre2   94 /  96   97.92%
                                          Re2     84 /  96   87.50%   (90.32% on translatable subset *)

Translator unit tests                                        178 / 178 passing

* translatable subset = the test set excluding patterns the RE2 translator deliberately rejects as engine-impossible (lookaround, backref, possessive quantifiers, atomic groups, (?U) semantic inversion). RE2's algorithm guarantees linear time and intentionally does not support these constructs.

Net takeaway for users that opt in to PCRE2: effective Java parity (100% on the OpenJDK corpus, 97.92% on RegExTest), with PCRE2's own resource-cap protection enforcing predictable runtime.

Net takeaway for users that stay on RE2: the translator alone, applied to RE2, picks up 229 corpus cases (+27 pp) without touching the engine or its linear-time guarantee.

Out of scope

We do not switch any existing production call site to PCRE2 in this PR. That belongs in separate per-call-site PRs with their own justification, perf data, and ReDoS-limits review.
We do not change RE2's defaults or its kMaxCompiledRegexes behaviour. Issues ExpressionFuzzer fails on like: Max number of regex reached #7824 / kMaxCompiledRegexes leads to non-deterministic SparkSQL regexp_replace behavior #8438 / fix: Use EvictingCacheMap for compiled regular expressions #15953 are out of scope; this PR's corpus is the regression baseline a future fix can use.
\b{g} (grapheme cluster boundary) and CANON_EQ full canonical-equivalence expansion are PCRE2 engine ceilings; documented in code comments. They are the only meaningful items not at 100% in the PCRE2 column above.

File-level overview

velox/external/regex_compat/                 ~3.0 kLOC   PCRE2 backend + test harness (default OFF)
  Re2Regex / Pcre2Regex / JavaRegex          three parallel concrete classes, re2::RE2-shaped API
  JavaMatcherAdapter                         re-builds java.util.regex.Matcher state machine
  BackendTestBase, TestMain                  TYPED_TEST plumbing + Java=100% guard
  tests/                                     OpenJDK corpus + ported pcre4j tests + RegExTest port

velox/functions/lib/java_pcre2_translator/   ~2.4 kLOC   reusable translator library
  PropertyMap / JdkPropertyExpander          \p{InGreek} / \p{javaXxx} / \p{IsXxx} rewriting
  ClassBodyParser / Evaluator / ClassRenderer  char-class AST + intersection resolution
  RangeSet                                   code-point range set algebra
  JavaRegexTranslator                        toPcre2Pattern / toRe2Pattern pipelines
  tests/                                     178 unit tests, 1:1 from pcre4j PR #606

Total: ~5.4 kLOC added; 0 lines changed in any existing source file outside CMake wiring.

Licensing

The translator port derives from pcre4j PR alexey-pelykh/pcre4j#606, which I authored (the entire org.pcre4j.regex.translate module was a from-scratch contribution by me to upstream pcre4j). pcre4j does not require a Contributor License Agreement, so copyright on my contribution remains mine, and I am dual-licensing it under Apache-2.0 for inclusion in this Apache-2.0 project.

The pcre4j upstream files carry a Copyright (C) 2024-2026 Oleksii PELYKH header — that is pcre4j's project-wide convention applied to all merged contributions, not a copyright assignment from contributors. The actual authorship of the translate module is mine and the Apache-2.0 grant here is therefore valid.

Per-file attribution headers in this PR retain the pcre4j provenance; LICENSE-NOTICE.md in the translator directory documents the full chain.

netlify · 2026-06-01T16:51:38Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`4f13785`
🔍 Latest deploy log	https://app.netlify.com/projects/meta-velox/deploys/6a1f90d0f85b6d0008501f24

github-actions · 2026-06-01T16:56:37Z

Build Impact Analysis

Full build recommended. Files outside the dependency graph changed:

velox/external/regex_compat/CMakeLists.txt
velox/external/regex_compat/JavaRegex.cpp
velox/external/regex_compat/JavaRegex.h
velox/external/regex_compat/JvmFixture.cpp
velox/external/regex_compat/JvmFixture.h
velox/external/regex_compat/Pcre2Regex.cpp
velox/external/regex_compat/Pcre2Regex.h
velox/external/regex_compat/README.md
velox/external/regex_compat/Re2Regex.cpp
velox/external/regex_compat/Re2Regex.h
... and 18 more

These directories are not fully covered by the dependency graph. A full build is the safest option.

cmake --build _build/release

Slow path • Graph generated from PR branch

Adds: * CMake/Findpcre2.cmake — system libpcre2-8 detection (CONFIG → pkg-config) * CMake/resolve_dependency_modules/pcre2.cmake — FetchContent fallback (PCRE2 10.45, 8-bit only, JIT on) * VELOX_ENABLE_REGEX_COMPAT_TESTS option (default OFF) * conditional velox_set_source/resolve_dependency(pcre2) wiring Verified: * default OFF config succeeds, no PCRE2 in build graph * -DVELOX_ENABLE_REGEX_COMPAT_TESTS=ON fetches sha256 0e138387df7835d7403b8351e2226c1377da804e0737db0e071b48f07c9d12ee and links libpcre2-8.a (965 KB) This is the build-infrastructure layer of the design in docs/superpowers/specs/2026-05-29-pcre2-cpp-test-suite-design.md. The helper layer and ported tests follow in subsequent commits. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Verified: * curl + sha256sum on pcre2-10.47.tar.gz = c08ae2388ef333e8403e670ad70c0a11f1eed021fd88308d7e02f596fcd9dc16 * FetchContent + build pcre2-8-static target succeeds * libpcre2-8.a links cleanly (980 KB, JIT enabled) * configure reports PACKAGE_VERSION='10.47' Spec and plan docs updated in sync; fallback target on build failure noted as 10.45. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…X_COMPAT_JAVA_BACKEND) When VELOX_ENABLE_REGEX_COMPAT_TESTS=ON, also try find_package(JNI): * found → log STATUS, enable an embedded-JVM Java backend in the forthcoming regex-compat test suite (3rd backend / live oracle) * not found → log WARNING and auto-disable, suite still builds with PCRE2 + RE2 only Default ON, fully gated under the (default-OFF) parent option, so stock Velox builds are unaffected. This is the only place in upstream Velox that touches JNI. Verified: * parent OFF → JNI probe never runs * parent ON, JDK present → 'Regex-compat: enabling embedded-JVM Java backend (JNI: /usr/lib/jvm/default-java/include;...)' appears * parent ON, this option explicit OFF → probe skipped silently Auto-degrade branch (JNI absent) is implemented per standard CMake idiom but not exercised in this commit's verification (find_package falls back to /usr/lib/jvm/* system paths even with bogus JAVA_HOME). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

First of three regex backends for the Java-regex compatibility test suite (Option E: three parallel concrete classes, no virtual, names shaped after re2::RE2 subset, semantics aligned with java.util.regex). Files: * velox/external/regex_compat/RegexTypes.h — shared Anchor + Options * velox/external/regex_compat/Re2Regex.h — public class * velox/external/regex_compat/Re2Regex.cpp — RE2 wrap, reuses Velox prepareRegexpReplacePattern / prepareRegexpReplaceReplacement (Re2Functions.h:402,422) for Java -> RE2 translation * velox/external/regex_compat/CMakeLists.txt — opt-in static lib * velox/external/regex_compat/tests/Re2RegexTest.cpp — 11 GTest cases * velox/CMakeLists.txt — add_subdirectory under VELOX_ENABLE_REGEX_COMPAT_TESTS Verified (logic-level): standalone compile + smoke-test outside the full Velox build (system libre2 + stubbed prepare*) shows 17/18 PASS on the subset that doesn't depend on Java-syntax translation; the 1 remaining case was a test-assertion bug (RE2 error text is 'invalid perl operator: (?=' not 'lookahead'), already fixed in this commit. Full in-tree CMake build of velox_regex_compat_test is unverified in this session — the from-scratch Velox build (boost/folly/abseil...) takes hours, exceeding session budget. Next session should: cmake -S . -B build -GNinja -DVELOX_ENABLE_REGEX_COMPAT_TESTS=ON \ -DVELOX_ENABLE_REGEX_COMPAT_JAVA_BACKEND=OFF cmake --build build --target velox_regex_compat_test ctest --test-dir build -R velox_regex_compat_test --output-on-failure Pcre2Regex + JavaRegex backends follow in subsequent commits. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Second of three regex backends. Same Option E architecture as Re2Regex: standalone concrete class, method names mirror re2::RE2, input accepts Java java.util.regex syntax. Files: * Pcre2Regex.h / .cpp — backend using pcre2_compile_8/match_8 with JIT * tests/Pcre2RegexTest.cpp — 12 GTest cases including lookahead + backref demos (features PCRE2 supports but RE2 doesn't) Key design choices: * PCRE2_UTF + PCRE2_UCP for Unicode-aware semantics * pcre2_jit_compile_8 with PCRE2_JIT_COMPLETE; falls back to interpreter on platforms without JIT * pcre2_substitute_8 with PCRE2_SUBSTITUTE_GLOBAL + PCRE2_SUBSTITUTE_EXTENDED — natively handles Java $N / ${name} replacement syntax, no translation layer needed * Java pattern syntax (?<name>...) accepted natively by PCRE2; no pre-flight scanner or translation layer in this class (Java-specific features like \p{InGreek}, character-class intersection, or (?U) flag will surface as test failures, documenting the need for a future Java->PCRE2 translator — cf. pcre4j PR facebookincubator#606) Test run (in-tree): $ ctest --test-dir cmake-build-debug -R velox_regex_compat_test [==========] 23 tests from 2 test suites ran. (2 ms total) [ PASSED ] 23 tests. - 11 Re2RegexTest (incl. javaNamedGroup + globalReplace via Velox prepareRegexpReplacePattern / prepareRegexpReplaceReplacement) - 12 Pcre2RegexTest (incl. lookaheadSupported + backrefSupported demonstrating PCRE2's expanded language coverage) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Third and final regex backend. Uses JNI_CreateJavaVM to boot an embedded JVM in the test process and drives java.util.regex.Pattern / Matcher through JNI calls. Java is the canonical source of truth for the Java-regex semantics that the other two backends approximate. Files: * JvmFixture.h / .cpp — process-singleton JVM owner; GTest GlobalEnvironment registered via JvmFixture::Register() in main * JavaRegex.h / .cpp — JavaRegex class; constructs by calling Pattern.compile, runs find()/matches()/lookingAt()/replaceAll via cached jmethodID's * tests/TestMain.cpp — entry point that registers the JVM fixture when VELOX_REGEX_COMPAT_HAS_JAVA=1 * tests/JavaRegexTest.cpp — 13 GTest cases including javaSpecificPropertyInLC (\p{InGreek}) which works natively in Java but won't in PCRE2 without a future Java->PCRE2 translator * CMakeLists.txt: conditional source list controlled by VELOX_ENABLE_REGEX_COMPAT_JAVA_BACKEND (uses velox_compile_definitions + velox_include_directories — target_* fails on ALIAS targets) Notable JNI details: * Class refs (Pattern, Matcher, Map, Set, Iterator, Entry, Integer, String) cached as NewGlobalRef * jmethodID's cached at first construction via std::call_once * Pattern.namedGroups() (JDK 20+) probed optionally — falls back to empty named map if unavailable * Match() materialises input[0..endpos) to a Java String (Java's Matcher needs a CharSequence; we'd need a custom CharSequence impl to avoid the copy, not worth it for tests) * GlobalReplace counts matches via find()-loop before calling replaceAll — the public API doesn't expose the substitution count Test run (in-tree, JNI ON): $ cmake-build-debug/.../velox_regex_compat_test [==========] 36 tests from 3 test suites ran. (193 ms total) [ PASSED ] 36 tests. - 11 Re2RegexTest (RE2 via Velox prepare* translation) - 12 Pcre2RegexTest (PCRE2 native Java-syntax acceptance) - 13 JavaRegexTest (java.util.regex via embedded JVM) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ase 7) GTest TYPED_TEST_SUITE templated on a type list of all enabled backends. One TYPED_TEST() declaration compiles into N test instances at compile time (N = 2 if JNI off, 3 if JNI on), so a single test body validates that all three engines deliver identical behaviour for the same Java input. Files: * tests/BackendTestBase.h — AllBackends typelist + BackendTest fixture * tests/BackendTypedTest.cpp — 13 typed cases covering compile, match (unanchored/anchored), find, fullPartialMatch, globalReplace (both $N and ${name}), case-insensitive, dotAll, multiline anchors, empty-group sentinel Also fixes a Re2Regex bug discovered by the new multiline typed test: * Java MULTILINE doesn't map to any single RE2 Options bit; RE2's `one_line=false` (the default) still requires inline `(?m)` to make `^`/`$` match around \n. Re2Regex now prefixes `(?m)` when `opt.oneLine == false`, mirroring Java's MULTILINE semantics additively without affecting `.` or other metas. Test result: [==========] 75 tests from 6 test suites ran. (86 ms total) [ PASSED ] 75 tests. - 11 Re2RegexTest (Re2-specific) - 12 Pcre2RegexTest (PCRE2-specific incl. lookahead/backref) - 13 JavaRegexTest (Java-specific incl. \p{InGreek}) - 39 BackendTest/* — 13 typed-cases x 3 backends, cross-checked Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Adds JavaMatcherAdapter<R> (test-only header) reconstructing the Java Matcher state machine (find/group/start/end/replaceAll/replaceFirst) on top of the backend's stateless Match() API. Then ports representative cases from pcre4j's PatternTests.java, MatcherMatchingTests.java, and MatcherReplacementTests.java as typed tests that run against every enabled backend. Files added: * tests/JavaMatcherAdapter.h * tests/PatternPortedTest.cpp (13 cases x N backends) * tests/MatcherMatchingPortedTest.cpp (14 cases x N backends) * tests/MatcherReplacementPortedTest.cpp (11 cases x N backends) Fixes a real bug in JavaRegex along the way: * splitUnicodeDelimiters surfaced that Java's Matcher.region() takes UTF-16 char offsets, not UTF-8 bytes, and Matcher.start()/end() returns char offsets too. Added javaCharOffsetToByteOffset and byteOffsetToJavaCharOffset helpers in JavaRegex.cpp and routed region() / start() / end() through them so the backend correctly handles non-ASCII input. Test result (in-tree, JNI ON): [==========] 189 tests from 15 test suites ran. (97 ms total) [ PASSED ] 189 tests. Breakdown: - 11 Re2RegexTest (RE2 backend specifics) - 12 Pcre2RegexTest (PCRE2 specifics incl. lookahead/backref) - 13 JavaRegexTest (Java specifics incl. \p{InGreek}) - 39 BackendTest (13 typed x 3 backends, core API) - 39 PatternPortedTest (13 typed x 3 backends, pcre4j Pattern) - 42 MatchingPortedTest (14 typed x 3 backends, pcre4j Matcher.find) - 33 ReplacementPortedTest (11 typed x 3 backends, pcre4j replace*) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ally Ports the 15 remaining cases from pcre4j MatcherMatchingTests.java (region-bounded find/matches/lookingAt, multiple-find-in-region, empty/unmatched group, lookaround, zero-width edges). Also extends JavaMatcherAdapter with region(int,int) and honors region bounds in find/matches/lookingAt. Test methodology change per user requirement: tests now assert Java-canonical behaviour with no `if constexpr` per-backend branching. Backends that diverge from Java just fail; failures are the *data*, not bugs. New TestMain reports a per-backend compatibility rate at end of run: ========== Per-backend compatibility rate ========== JavaRegex (typed) 67/67 (100%) ← ground truth Pcre2Regex (typed) 66/67 (98.5%) ← braceQuantifierIncomplete Re2Regex (typed) 63/67 (94.0%) ← lookaround x2 + brace + zw-region ==================================================== The 5 FAIL cases are real engine differences: * braceQuantifierIncomplete — Java rejects 'a{'; PCRE2/RE2 accept literally * positiveLookaround / positiveUnmatchedLookaround — RE2 lacks lookaround * findWithZeroWidthMatchExhaustsRegion — RE2/PCRE2 $ ignores region end Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Add appendReplacement/appendTail/quoteReplacement to JavaMatcherAdapter. - Port 24 cases from pcre4j MatcherReplacementTests.java covering quoteReplacement, replaceAll, replaceFirst, appendReplacement walks, named groups, escaped chars, error cases. - Fix JavaRegex.toJString/fromJString to transcode through UTF-16 instead of using NewStringUTF/GetStringUTFChars (modified-UTF-8), which mis-encoded 4-byte UTF-8 sequences (supplementary chars). Round-trips emoji like U+1F310 / U+1F30D correctly now. - Add Java=100% guard in TestMain that loudly warns on stderr if the Java backend ever drops below 100%. Current matrix: JavaRegex 94/94 (100%) Pcre2Regex 93/94 (98.9%) — brace strictness Re2Regex 90/94 (95.7%) — lookaround + brace + zero-width region Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Java's Matcher.results() returns a Stream<MatchResult>; modelled here as a find()-loop that snapshots (start, end, group) per match. Skipped 1 case (resultsStreamOperations) that only exercises Java's stream API. Matrix: Java 106/106 (100%) Pcre2 105/106 (99.06%) Re2 101/106 (95.28%) — +1 fail on resultsZeroWidthMatches (lookahead) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…f 9) Most MatcherMatchResultTests.java cases exercise Java's MatchResult snapshot semantics (immutability, IllegalStateException/IOOBE/IAE contracts, namedGroups() map equality, hasMatch() flag) — pure Java API-contract tests, no regex-engine signal. Ported only the two cases that exercise engine behavior not already covered: * matchResultByGroupNumber — 3 sibling captures, full groupCount/group(i)/start(i)/end(i) sweep. * matchResultNamedGroupAccessors — 3 named groups in a date pattern. Adapter: add start(name)/end(name) overloads to mirror group(name). Matrix: Java 108/108 (100%) Pcre2 107/108 (99.07%) Re2 103/108 (95.37%) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Java Pattern.split implemented as a free helper that drives the backend's find() loop via JavaMatcherAdapter, so any engine differences in find/match propagate naturally into split output. Skipped: * splitWithDelimiters* (3 cases) — Java 21+ API, not in our embedded JDK 17. Matrix: Java 121/121 (100%) Pcre2 120/121 (99.17%) Re2 116/121 (95.87%) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Multi-byte UTF-8 literals (1/2/3/4-byte chars) + combined + region. Offsets translated from Java UTF-16 char positions to UTF-8 byte positions (e.g., region(3,5) over the surrogate pair for 🌍 → byte region(7,11)). Matrix: Java 127/127 (100%) Pcre2 126/127 (99.21%) Re2 122/127 (96.06%) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…aining) Added 6 PatternTests.java cases: * quote — Pattern.quote + \Q...\E round-trip * commentsWhitespaceIgnored — (?x) ignores unescaped spaces * commentsHashComments — (?x) # to end-of-line is a comment * commentsEscapedWhitespace — \ <space> matched literally * commentsWhitespaceInCharacterClass — [\ ] matches literal space * commentsEmbeddedFlag — (?x) at start of pattern PatternTests.java cases intentionally NOT ported (out of engine-compat scope): * toStringReturnsPattern, asPredicate*/asMatchPredicate*/splitAsStream — Java functional/utility API only, no engine behavior. * canonEq* (24 cases) — CANON_EQ flag explicitly out of scope per spec section 1.2. * unicodeCase*/flag-introspection (~7) — Java's UNICODE_CASE/UNICODE_CHARACTER_CLASS/UNIX_LINES need backend-specific flag plumbing (RE2/PCRE2 don't have inline-flag equivalents with the same semantics); skip rather than build a partial framework that can't represent the divergence faithfully. * matchesStatic* — duplicate of existing FullMatch coverage. * translatorThrownPatternSyntaxException — pcre4j internal. Matrix: Java 133/133 (100%) Pcre2 132/133 (99.25%) — brace strictness Re2 123/133 (92.48%) — lookaround, brace, zero-width region, 4 (?x) sub-cases Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Fetches OpenJDK 17 (jdk-17.0.13-ga) regex test corpus at CMake configure time (SHA256 pinned), parses it mirror-faithfully to OpenJDK's own RegExTest.processFile / grabLine, and reports per-backend compat rate. Comparison approach: rebuild the result string in the exact shape OpenJDK's RegExTest produces ("true <g0> <gc> <g1> <g2>..." or "false <gc>") and string-equal it against the corpus expected line. This sidesteps ambiguous whitespace-in-match-text parsing. processEscapes mirrors OpenJDK grabLine: only \n and \uXXXX are processed (everything else passes through as a regex meta-escape). Surrogate pairs \uD8##\uDC## are combined into the supplementary code point and encoded as 4-byte UTF-8, so RE2/PCRE2 see valid UTF-8 and JavaRegex's JNI bridge re-splits to a surrogate pair. Patterns whose expected line begins with "error" map to compile-error in our backends → counted as PASS (Java agrees they're invalid). OpenJDK corpus matrix: Java 299/299 (100%) Pcre2 217/299 (72.58%) [36 compile-err on Java-specific syntax] Re2 170/299 (56.86%) [83 compile-err: lookaround, backrefs, ...] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

OpenJDK's RegExTest.processFile runs three corpus files; we were only loading TestCases.txt. Now loads & reports BMPTestCases.txt and SupplementaryTestCases.txt as well (per-file rates + per-backend aggregate). Per-file breakdown: Java|TestCases.txt 299/299 (100.00%) Java|BMPTestCases.txt 222/222 (100.00%) Java|SupplementaryTestCases.txt 339/339 (100.00%) Pcre2|TestCases.txt 217/299 (72.58%) Pcre2|BMPTestCases.txt 154/222 (69.37%) Pcre2|SupplementaryTestCases.txt 176/339 (51.92%) Re2|TestCases.txt 170/299 (56.86%) Re2|BMPTestCases.txt 126/222 (56.76%) Re2|SupplementaryTestCases.txt 232/339 (68.44%) Aggregate (all 860 cases × 3 backends): Java 860/860 (100.00%) Pcre2 547/860 (63.60%) Re2 528/860 (61.40%) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

New library velox/functions/lib/java_pcre2_translator/ (will host the C++ port of pcre4j PR facebookincubator#606's org.pcre4j.regex.translate module). This commit adds: * CMakeLists.txt registering velox_java_pcre2_translator * JavaRegexTranslator.{h,cpp} — public surface declaration plus a Phase-1 identity-passthrough implementation of toPcre2Pattern * EvaluationFailedException.h * LICENSE-NOTICE.md documenting the pcre4j → Velox re-licensing * tests/JavaRegexTranslatorTest.cpp — smoke test (identity + ctor) * Wires the new subdir into velox/functions/lib/CMakeLists.txt No behavior change: regex-compat corpus rates unchanged Java 860/860 (100.00%) | Pcre2 547/860 (63.60%) | Re2 528/860 (61.40%) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

1:1 port of org.pcre4j.regex.translate.RangeSet (257 LOC Java → ~210 LOC C++). Inlined a private emitLiteralInClass helper to avoid a circular dep on the (yet to be ported) ClassRenderer module; the duplication will go away (or be intentional) when Phase 4 lands. Tests ported 1:1 from RangeSetTest.java (212 LOC → 23 GTest cases). All 23 pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

1:1 port of org.pcre4j.regex.translate.PropertyMap (185 LOC Java → ~190 LOC C++). Static lookup table from Java property names to PCRE2 equivalents (\p{javaXxx}, \p{InGreek}, \p{IsL}, posix Lower/Upper/...). Tests ported 1:1 from PropertyMapTest.java (68 LOC → 9 GTest cases). All pass. NOTE: JdkPropertyExpander (the other Phase-3 file in the original plan, 277 LOC) requires a full-Unicode codepoint scan via JDK Character API. It's deferred to Phase 4 where it plugs into the Evaluator. Either an ICU-based implementation or a pre-generated static table will land together with the AST/parser/evaluator work. Suite totals: 34 tests (2 scaffolding + 23 RangeSet + 9 PropertyMap). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…+ JdkPropertyExpander Port the char-class AST, recursive parser, evaluator, renderer, and JdkPropertyExpander from pcre4j Java to Velox C++ without wiring to toPcre2Pattern.\n\nJdkPropertyExpander uses Velox's existing ICU dependency (u_charType/uscript_getScript/ublock_getCode/binary properties) rather than generated tables or a stub.\n\nValidation: velox_java_pcre2_translator_test passes 121/121; regex compat Java corpus remains 860/860 (100.00%).\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Port the final top-level Java regex translation pipeline: property rewrites, character-class parsing/rendering, Java inline-flag normalization, escape handling, and conservative EvaluationFailedException paths for cases PCRE2 cannot safely express. Add translator pipeline coverage and regressions for block-qualified properties, comments-mode braces, flag scoping, cased-property folding, counted closures, long backreferences, and surrogate-pair unicode escapes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Pcre2Regex constructor now translates the incoming Java regex pattern to PCRE2 syntax via java_pcre2_translator::toPcre2Pattern before calling pcre2_compile. EvaluationFailedException is caught and the message surfaced verbatim via error_ (ok()=false). OpenJDK corpus impact: Pcre2 547/860 (63.60%) → 762/860 (88.60%) +25 pp Pcre2 compile-errs 165 → 76 -89 Java backend unchanged at 860/860 (100%); Re2 unchanged at 528/860 (61.40%) — RE2 wiring is Phase 7. The 76 remaining Pcre2 compile-errors are corpus patterns using features the translator still passes through verbatim (e.g., character classes with intersection operands that include script ranges PCRE2 doesn't recognise, edge cases in posix class expansions, etc.). Will be incrementally addressed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add toRe2Pattern using the Java regex translator pipeline for Unicode property and character-class normalization, Java named-group rewriting, Java COMMENTS-mode translation, and RE2 octal escape rendering. Detect unsupported RE2 features up front, including lookaround, backreferences, possessive quantifiers, atomic groups, and Java flags without RE2 equivalents. OpenJDK corpus: Java 860/860 (100.00%) Pcre2 762/860 (88.60%) (unchanged from Phase 6) Re2 755/860 (87.79%) (was 528/860 / 61.40%) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Expose a translator side channel for raw surrogate byte mode and rewrite surrogate escape aliases to byte-sequence PCRE2 before compiling without PCRE2_UTF. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Port Java property and Unicode block materialization from the pcre4j fork, including surrogate block sentinel handling and class parser/renderer updates. Adjust the PCRE2 backend for Java default shorthand semantics and raw surrogate UTF-8 matching used by the OpenJDK corpus. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…gate failures Brings PCRE2 OpenJDK corpus to 860/860 (100.00%), matching Java. Key changes: * translator: lone-surrogate property tokens (\p{InHIGH_SURROGATES}, \p{InLOW_SURROGATES}, surrogate codepoints in \x{...}) now translated to raw-byte CESU-8 sequences in PCRE2 raw-byte mode * Pcre2Regex: when raw-byte mode is selected, char-class ranges that include surrogates are rewritten to byte-sequence regex matching CESU-8 encoding * PropertyMap / ClassRenderer / RangeSet: minor tweaks to support the surrogate handling path Validation: velox_java_pcre2_translator_test: 174/174 velox_regex_compat_test: 445 passed, 4 expected skips OpenJDK corpus: Java 860/860 (100.00%) Pcre2 860/860 (100.00%) ← was 854/860 Re2 757/860 ( 88.02%) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Port 80 active RegExTest bodies to the typed regex-compat suite and add explicit TODO skips for Java-specific or adapter-limited cases. The port records per-backend compatibility rates while preserving Java as the canonical pass requirement. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

After P1.8 added 124 RegExTest cases with 28 GTEST_SKIP entries (Java-API-only behaviour with no C++ adapter equivalent), the per- backend tally listener was counting skipped tests as not-passing, making Java backend appear to drop from 100% to 89% (231/259). Treat Skipped() as neither pass nor fail. Skipped count is reported separately for visibility: JavaRegex (typed) 231 / 231 (100%) [skipped: 28] Pcre2Regex(typed) 231 / 231 (100%) [skipped: 28] Re2Regex (typed) 227 / 227 (100%) [skipped: 32] OpenJDK corpus + RegExTest separate-tally blocks (which use their own reporters, not GTest pass/fail) are unaffected. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ssible patterns When a corpus / RegExTest pattern uses features the target engine cannot support natively (RE2 lookaround, backref, possessive, atomic; PCRE2 has none in this category — its translator-rejected list is empty), the translator throws EvaluationFailedException and the backend marks the pattern as not-ok with an error prefix "Java→RE2 translator: ...". These are engine ceilings, not implementation bugs. Both OpenJDK corpus and RegExTest reports now print an additional "translatable subset" line that excludes such patterns: OpenJDK corpus: Re2 757 / 860 (88.02%) [compile-err: 99] Re2 (translatable subset) 757 / 771 (98.18%) [excludes 89 translator-rejected] RegExTest ported: Re2 84 / 96 (87.50%) Re2 (translatable subset) 84 / 93 (90.32%) [excludes 3 translator-rejected] Implementation: - OpenJdkCorpusDiffTest tracks translatorRejected based on the Re2Regex/Pcre2Regex error_ string prefix. - RegExTestPortedTest adds a thread_local flag tlsTranslatorRejected set by a notePatternStatus helper called from every find/match helper; the PORTED_REGEX_TEST macro resets it before each test body and recordCase consumes it afterwards. Java / Pcre2 corpus remain 100.00%; Re2 raw rate unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…pply pre-commit formatting CI on PR facebookincubator#17685 was failing on: * GCC 14 -Werror=dangling-pointer in ClassBodyParserTest: auto* inter = parse("...").getIf<Intersection>(); // dangling Bound to a named local first instead. * pre-commit: license-header / clang-format / gersemi Ran 'pre-commit run --files ...' over the touched files; only formatting changes (no behavioural diff). Verified: velox_java_pcre2_translator_test: 178/178 passing OpenJDK corpus: Java/Pcre2 860/860, Re2 757/860 (98.18% translatable subset) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

mbasmanova

Thank you for the thorough proposal and for addressing each of the historical concerns from #9897 and #9823. The approach is reasonable — PCRE2 with built-in limits is a better answer than Oniguruma or Hyperscan, and the translator improving RE2's corpus pass-rate from 61% to 88% is valuable on its own.

However, 12K lines in a single PR is not reviewable. This contains three independent components that should be separate PRs:

Java→PCRE2/RE2 translator (velox/functions/lib/java_pcre2_translator/) — builds unconditionally, benefits RE2 immediately, no new dependencies.
PCRE2 engine wrapper (velox/external/regex_compat/) — the new engine, gated behind VELOX_ENABLE_PCRE2_BACKEND.
Test harness (velox/external/regex_compat/tests/) — the JVM-oracle-based test framework.

Items 2 and 3 need broader discussion before we commit to them. Adding a second regex engine is a significant maintenance burden — two code paths to test, debug, and keep compatible. The opt-in CMake flag means CI either tests both paths (doubling regex test matrix) or only tests RE2 (leaving PCRE2 undertested). We'd like to discuss the long-term ownership and maintenance plan before merging the engine itself.

Please split accordingly. The translator PR alone would be a significant contribution and can be reviewed independently. The PCRE2 wrapper and test harness can follow once the translator lands.

For the translator PR: please add documentation (README or header-level docs) explaining what transformations it performs, what it doesn't handle, and the testing strategy. The 178 unit tests are there, but a reader needs to understand what coverage they provide — are these ported 1:1 from pcre4j? Do they cover the transformations that matter most for Spark/Presto patterns?

Also: before the translator PR, we'd like to understand the integration plan. How will existing Velox functions (regexp_extract, regexp_replace, regexp_like, like, Spark split) use the translator and/or PCRE2? Is this a per-function opt-in, a global config, or a per-query setting? The translator library is useful, but the review needs to understand how it connects to the rest of the system.

CI 'Build with GCC / Linux release with adapters' was failing because: * clang-tidy scans every diff-touched header in isolation; JavaRegex.h / JvmFixture.h unconditionally include <jni.h>, which is absent on hosts without a JDK. * RegExTestPortedTest.cpp's PORTED_REGEX_TEST macro and OpenJdkCorpusDiffTest.cpp's Grapheme test both reference the JavaRegex symbol unconditionally; when JAVA_BACKEND is OFF the symbol is not declared and compilation fails. Fix: * Wrap the bodies of JavaRegex.h and JvmFixture.h in #if VELOX_REGEX_COMPAT_HAS_JAVA so they compile to nothing when the Java backend is off. * In RegExTestPortedTest.cpp, gate the 'Java backend regression' EXPECT in PORTED_REGEX_TEST behind the same macro. * In OpenJdkCorpusDiffTest.cpp, gate toJString / directJavaGraphemeBreakOffsets helpers and every 'is_same_v<TypeParam, JavaRegex>' branch behind the macro. Verified both build paths locally: -DVELOX_ENABLE_REGEX_COMPAT_JAVA_BACKEND=ON -> regex_compat_test passes (Java/Pcre2 860/860, Re2 757/860) -DVELOX_ENABLE_REGEX_COMPAT_JAVA_BACKEND=OFF -> builds clean Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…_COMPAT_HAS_JAVA Follow-up to previous commit: the .h files were guarded but the .cpp files still unconditionally use JNI types (jint, jclass, ...) which clang-tidy on hosts without a JDK cannot resolve, causing 'Build with GCC / Linux release with adapters' to fail. Wrap each .cpp body in #if VELOX_REGEX_COMPAT_HAS_JAVA / #endif so they compile to an empty translation unit when the Java backend is off. Verified locally: JAVA_BACKEND=ON: regex_compat_test passes (Java/Pcre2 860/860, Re2 757/860 / 98.18% translatable subset) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

baibaichen requested a review from majetideepak as a code owner June 1, 2026 16:51

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 1, 2026

baibaichen and others added 26 commits June 1, 2026 17:25

regex-compat: add README (Phase 11)

8d41026

Fix escaped non-ASCII class literals

16f3ae2

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

copilot and others added 10 commits June 1, 2026 17:25

Allow PCRE2 surrogate corpus patterns

30ec781

Expose a translator side channel for raw surrogate byte mode and rewrite surrogate escape aliases to byte-sequence PCRE2 before compiling without PCRE2_UTF. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

regex-compat: add OpenJDK GraphemeTestCases corpus (Java-only)

705310c

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

regex-compat: extend RegExTest port (16 more tests)

b4a1fcb

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

regex-compat: translator pre-folds cased literals for UNICODE_CASE

6c87822

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

baibaichen force-pushed the feat/regex-compat-pcre2-cmake branch from 082c3b0 to d4fdfe6 Compare June 1, 2026 17:52

mbasmanova requested review from amitkdutta and kgpai June 2, 2026 00:20

mbasmanova requested changes Jun 2, 2026

View reviewed changes

copilot and others added 2 commits June 2, 2026 04:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add opt-in PCRE2 regex engine and Java-regex compatibility test suite#17685

feat: Add opt-in PCRE2 regex engine and Java-regex compatibility test suite#17685
baibaichen wants to merge 38 commits into
facebookincubator:mainfrom
baibaichen:feat/regex-compat-pcre2-cmake

baibaichen commented Jun 1, 2026 •

edited

Loading

Uh oh!

netlify Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

mbasmanova left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

baibaichen commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposal

Why a second engine, not a replacement

Five concerns, one by one

What's in the PR

1. PCRE2 engine wrapper — velox/external/regex_compat/Pcre2Regex.{h,cpp}

2. Java→PCRE2/RE2 translator — velox/functions/lib/java_pcre2_translator/

3. Test harness — velox/external/regex_compat/tests/

Default-build impact

Motivation — gap is documented, recurring, and previously unmeasured

Validation

Out of scope

File-level overview

Licensing

Uh oh!

netlify Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for meta-velox canceled.

Uh oh!

github-actions Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Build Impact Analysis

Uh oh!

mbasmanova left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

baibaichen commented Jun 1, 2026 •

edited

Loading

1. PCRE2 engine wrapper — `velox/external/regex_compat/Pcre2Regex.{h,cpp}`

2. Java→PCRE2/RE2 translator — `velox/functions/lib/java_pcre2_translator/`

3. Test harness — `velox/external/regex_compat/tests/`

netlify Bot commented Jun 1, 2026 •

edited

Loading

github-actions Bot commented Jun 1, 2026 •

edited

Loading