Skip to content

feat: Add opt-in PCRE2 regex engine and Java-regex compatibility test suite#17685

Open
baibaichen wants to merge 38 commits into
facebookincubator:mainfrom
baibaichen:feat/regex-compat-pcre2-cmake
Open

feat: Add opt-in PCRE2 regex engine and Java-regex compatibility test suite#17685
baibaichen wants to merge 38 commits into
facebookincubator:mainfrom
baibaichen:feat/regex-compat-pcre2-cmake

Conversation

@baibaichen
Copy link
Copy Markdown
Contributor

@baibaichen baibaichen commented Jun 1, 2026

Proposal

Add PCRE2 as a second regex engine in Velox, enabled at compile time via a new CMake option (default OFF). The existing RE2 path is unchanged; users that need fuller java.util.regex semantics — primarily Spark / Presto front-ends running through Gluten — can opt in at build time.

The PR also ships:

  • a Java-regex-syntax translator library that also benefits RE2 (lifts RE2's OpenJDK 17 corpus pass-rate from 528/860 → 757/860, +27 pp), and
  • a multi-backend test harness that uses an embedded JVM as Java-semantic oracle.

Why a second engine, not a replacement

Past discussion ( #9897, #9823 ) repeatedly raised the RE2 vs java.util.regex gap and equally raised legitimate concerns about every proposed alternative. PCRE2 is qualitatively different from Oniguruma / Hyperscan, and this PR addresses each historical objection directly:

Five concerns, one by one

ReDoS / non-linear runtime@mbasmanova on #9823: "It is not realistic to expect users to 'check the provided patterns carefully'."

PCRE2 has built-in LIMIT_MATCH / LIMIT_DEPTH / LIMIT_HEAP that the engine itself enforces. Our Pcre2Regex wrapper sets conservative defaults and exposes them as Options fields. Hitting a limit surfaces a typed error rather than running forever. This is the same protection Java 9+ added to java.util.regex. Oniguruma and Hyperscan do not have this.

Oniguruma trades reliability for coverage@mbasmanova on #9897.

We do not ship Oniguruma.

Hyperscan effectively abandoned@philo-he on #9823.

We do not ship Hyperscan. PCRE2 by contrast is actively maintained (PHP, Apache HTTPD, nginx, git all depend on it; releases roughly every 6 months).

Velox SQL is single-pattern per row; multi-pattern engines don't pay off@mbasmanova on #9823.

PCRE2 is a single-pattern engine; matches that constraint.

User must opt in if accepting any risk@FelixYBW on #9823: "In Gluten we use a config to enable re2 offload. Gluten user needs to take the risk if they enable."

New CMake option VELOX_ENABLE_PCRE2_BACKEND=OFF by default; identical opt-in pattern. Default cmake ... adds zero new dependencies.

What's in the PR

1. PCRE2 engine wrapper — velox/external/regex_compat/Pcre2Regex.{h,cpp}

Production-quality wrapper around PCRE2 with a re2::RE2-shaped surface so call sites that opt in can switch with minimal churn:

  • JIT-compiles on construction (falls back to interpreter where unavailable).
  • Honors Options.{caseSensitive, dotNl, oneLine, ...} matching Velox's existing Re2Regex options.
  • Sets PCRE2's built-in match / depth / heap limits to safe defaults; exposed as Options fields so callers can tune them. Hitting a limit surfaces a typed error, not a hang.
  • Integrates the translator below: every Java pattern goes through toPcre2Pattern before pcre2_compile, so Java-specific syntax (\p{InGreek}, \p{javaXxx}, (?U) flag inversion, char-class intersection, …) just works.

2. Java→PCRE2/RE2 translator — velox/functions/lib/java_pcre2_translator/

Reusable C++ library that rewrites a java.util.regex pattern into PCRE2- or RE2-compatible syntax:

  • 1:1 port of pcre4j PR (feat) regex: java.util.regex compatibility translator + :compat-test harness (68% -> 93.7%) alexey-pelykh/pcre4j#606. I authored that pcre4j PR (wrote the org.pcre4j.regex.translate module from scratch and contributed it upstream); pcre4j does not require a CLA, so I retain copyright on that contribution and am dual-licensing the C++ port here under Apache-2.0. Per-file headers carry the attribution; LICENSE-NOTICE.md documents provenance.
  • Adds toRe2Pattern that reuses the same parser/evaluator/property map and additionally rejects features RE2 fundamentally cannot represent (lookaround, backref, possessive, atomic, (?U) semantic inversion) — throws EvaluationFailedException rather than silently changing semantics.
  • Builds unconditionally (it's a functions/lib/ member); production call sites can adopt it independently of the PCRE2 backend.

Headline result for RE2: when the existing Re2Regex is wired through this translator, RE2's OpenJDK 17 corpus pass-rate jumps from 528/860 (61.40%) to 757/860 (88.02%) without changing the engine. That benefit alone — for anyone who wants to stay on RE2 — is the second reason to merge.

3. Test harness — velox/external/regex_compat/tests/

Three-backend GTest framework (Re2Regex, Pcre2Regex, JavaRegex via embedded JVM as oracle), parameterised via TYPED_TEST_SUITE. Runs:

  • OpenJDK 17 corpus (4 files, 860 + 5 cases) fetched at CMake configure time, SHA-pinned to jdk-17.0.13-ga.
  • 96 ported tests from OpenJDK RegExTest.java.
  • 178 translator unit tests (1:1 from pcre4j).

A TestMain guard asserts the Java backend remains 100% and loudly warns otherwise — Java is java.util.regex, so any failure there means our JNI bridge / adapter is wrong, not an engine difference. The harness is gated behind VELOX_ENABLE_REGEX_COMPAT_TESTS=OFF with a further sub-option VELOX_ENABLE_REGEX_COMPAT_JAVA_BACKEND=OFF that auto-disables if no JDK is found.

Default-build impact

Zero. With the default cmake ... invocation:

  • No PCRE2 download or link.
  • No JNI / JDK probe.
  • No new code in any existing target.
# default — nothing changes
cmake -S . -B build

# enable PCRE2 engine for production use (adds PCRE2 10.47 via FetchContent)
cmake -S . -B build -DVELOX_ENABLE_PCRE2_BACKEND=ON

# also run the compatibility test suite
cmake -S . -B build -DVELOX_ENABLE_PCRE2_BACKEND=ON \
                    -DVELOX_ENABLE_REGEX_COMPAT_TESTS=ON

# include the Java JNI oracle in the test suite (needs JDK >= 17)
cmake -S . -B build -DVELOX_ENABLE_PCRE2_BACKEND=ON \
                    -DVELOX_ENABLE_REGEX_COMPAT_TESTS=ON \
                    -DVELOX_ENABLE_REGEX_COMPAT_JAVA_BACKEND=ON

Motivation — gap is documented, recurring, and previously unmeasured

Validation

OpenJDK 17 corpus (860 cases)             Java   860 / 860  100.00%
                                          Pcre2  860 / 860  100.00%   ←  drop-in Java parity
                                          Re2    757 / 860   88.02%   (98.18% on translatable subset *)
                                          Re2 raw (no translator)  528 / 860  61.40%

OpenJDK RegExTest.java (96 ported)        Java    96 /  96  100.00%
                                          Pcre2   94 /  96   97.92%
                                          Re2     84 /  96   87.50%   (90.32% on translatable subset *)

Translator unit tests                                        178 / 178 passing

* translatable subset = the test set excluding patterns the RE2 translator deliberately rejects as engine-impossible (lookaround, backref, possessive quantifiers, atomic groups, (?U) semantic inversion). RE2's algorithm guarantees linear time and intentionally does not support these constructs.

Net takeaway for users that opt in to PCRE2: effective Java parity (100% on the OpenJDK corpus, 97.92% on RegExTest), with PCRE2's own resource-cap protection enforcing predictable runtime.

Net takeaway for users that stay on RE2: the translator alone, applied to RE2, picks up 229 corpus cases (+27 pp) without touching the engine or its linear-time guarantee.

Out of scope

File-level overview

velox/external/regex_compat/                 ~3.0 kLOC   PCRE2 backend + test harness (default OFF)
  Re2Regex / Pcre2Regex / JavaRegex          three parallel concrete classes, re2::RE2-shaped API
  JavaMatcherAdapter                         re-builds java.util.regex.Matcher state machine
  BackendTestBase, TestMain                  TYPED_TEST plumbing + Java=100% guard
  tests/                                     OpenJDK corpus + ported pcre4j tests + RegExTest port

velox/functions/lib/java_pcre2_translator/   ~2.4 kLOC   reusable translator library
  PropertyMap / JdkPropertyExpander          \p{InGreek} / \p{javaXxx} / \p{IsXxx} rewriting
  ClassBodyParser / Evaluator / ClassRenderer  char-class AST + intersection resolution
  RangeSet                                   code-point range set algebra
  JavaRegexTranslator                        toPcre2Pattern / toRe2Pattern pipelines
  tests/                                     178 unit tests, 1:1 from pcre4j PR #606

Total: ~5.4 kLOC added; 0 lines changed in any existing source file outside CMake wiring.

Licensing

The translator port derives from pcre4j PR alexey-pelykh/pcre4j#606, which I authored (the entire org.pcre4j.regex.translate module was a from-scratch contribution by me to upstream pcre4j). pcre4j does not require a Contributor License Agreement, so copyright on my contribution remains mine, and I am dual-licensing it under Apache-2.0 for inclusion in this Apache-2.0 project.

The pcre4j upstream files carry a Copyright (C) 2024-2026 Oleksii PELYKH header — that is pcre4j's project-wide convention applied to all merged contributions, not a copyright assignment from contributors. The actual authorship of the translate module is mine and the Apache-2.0 grant here is therefore valid.

Per-file attribution headers in this PR retain the pcre4j provenance; LICENSE-NOTICE.md in the translator directory documents the full chain.

@baibaichen baibaichen requested a review from majetideepak as a code owner June 1, 2026 16:51
@netlify
Copy link
Copy Markdown

netlify Bot commented Jun 1, 2026

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 4f13785
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/6a1f90d0f85b6d0008501f24

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 1, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Build Impact Analysis

Full build recommended. Files outside the dependency graph changed:

  • velox/external/regex_compat/CMakeLists.txt
  • velox/external/regex_compat/JavaRegex.cpp
  • velox/external/regex_compat/JavaRegex.h
  • velox/external/regex_compat/JvmFixture.cpp
  • velox/external/regex_compat/JvmFixture.h
  • velox/external/regex_compat/Pcre2Regex.cpp
  • velox/external/regex_compat/Pcre2Regex.h
  • velox/external/regex_compat/README.md
  • velox/external/regex_compat/Re2Regex.cpp
  • velox/external/regex_compat/Re2Regex.h
  • ... and 18 more

These directories are not fully covered by the dependency graph. A full build is the safest option.

cmake --build _build/release

Slow path • Graph generated from PR branch

baibaichen and others added 26 commits June 1, 2026 17:25
Adds:
  * CMake/Findpcre2.cmake             — system libpcre2-8 detection (CONFIG → pkg-config)
  * CMake/resolve_dependency_modules/pcre2.cmake
                                       — FetchContent fallback (PCRE2 10.45, 8-bit only, JIT on)
  * VELOX_ENABLE_REGEX_COMPAT_TESTS option (default OFF)
  * conditional velox_set_source/resolve_dependency(pcre2) wiring

Verified:
  * default OFF config succeeds, no PCRE2 in build graph
  * -DVELOX_ENABLE_REGEX_COMPAT_TESTS=ON fetches sha256
    0e138387df7835d7403b8351e2226c1377da804e0737db0e071b48f07c9d12ee
    and links libpcre2-8.a (965 KB)

This is the build-infrastructure layer of the design in
docs/superpowers/specs/2026-05-29-pcre2-cpp-test-suite-design.md.
The helper layer and ported tests follow in subsequent commits.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Verified:
  * curl + sha256sum on pcre2-10.47.tar.gz =
    c08ae2388ef333e8403e670ad70c0a11f1eed021fd88308d7e02f596fcd9dc16
  * FetchContent + build pcre2-8-static target succeeds
  * libpcre2-8.a links cleanly (980 KB, JIT enabled)
  * configure reports PACKAGE_VERSION='10.47'

Spec and plan docs updated in sync; fallback target on build
failure noted as 10.45.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…X_COMPAT_JAVA_BACKEND)

When VELOX_ENABLE_REGEX_COMPAT_TESTS=ON, also try find_package(JNI):
  * found → log STATUS, enable an embedded-JVM Java backend in the
    forthcoming regex-compat test suite (3rd backend / live oracle)
  * not found → log WARNING and auto-disable, suite still builds with
    PCRE2 + RE2 only

Default ON, fully gated under the (default-OFF) parent option, so stock
Velox builds are unaffected. This is the only place in upstream Velox
that touches JNI.

Verified:
  * parent OFF → JNI probe never runs
  * parent ON, JDK present → 'Regex-compat: enabling embedded-JVM Java
    backend (JNI: /usr/lib/jvm/default-java/include;...)' appears
  * parent ON, this option explicit OFF → probe skipped silently

Auto-degrade branch (JNI absent) is implemented per standard CMake
idiom but not exercised in this commit's verification (find_package
falls back to /usr/lib/jvm/* system paths even with bogus JAVA_HOME).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
First of three regex backends for the Java-regex compatibility test
suite (Option E: three parallel concrete classes, no virtual, names
shaped after re2::RE2 subset, semantics aligned with java.util.regex).

Files:
  * velox/external/regex_compat/RegexTypes.h   — shared Anchor + Options
  * velox/external/regex_compat/Re2Regex.h     — public class
  * velox/external/regex_compat/Re2Regex.cpp   — RE2 wrap, reuses Velox
    prepareRegexpReplacePattern / prepareRegexpReplaceReplacement
    (Re2Functions.h:402,422) for Java -> RE2 translation
  * velox/external/regex_compat/CMakeLists.txt — opt-in static lib
  * velox/external/regex_compat/tests/Re2RegexTest.cpp — 11 GTest cases
  * velox/CMakeLists.txt — add_subdirectory under VELOX_ENABLE_REGEX_COMPAT_TESTS

Verified (logic-level): standalone compile + smoke-test outside the
full Velox build (system libre2 + stubbed prepare*) shows 17/18 PASS
on the subset that doesn't depend on Java-syntax translation; the 1
remaining case was a test-assertion bug (RE2 error text is 'invalid
perl operator: (?=' not 'lookahead'), already fixed in this commit.

Full in-tree CMake build of velox_regex_compat_test is unverified in
this session — the from-scratch Velox build (boost/folly/abseil...)
takes hours, exceeding session budget. Next session should:
  cmake -S . -B build -GNinja -DVELOX_ENABLE_REGEX_COMPAT_TESTS=ON \
                      -DVELOX_ENABLE_REGEX_COMPAT_JAVA_BACKEND=OFF
  cmake --build build --target velox_regex_compat_test
  ctest --test-dir build -R velox_regex_compat_test --output-on-failure

Pcre2Regex + JavaRegex backends follow in subsequent commits.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Second of three regex backends.  Same Option E architecture as
Re2Regex: standalone concrete class, method names mirror re2::RE2,
input accepts Java java.util.regex syntax.

Files:
  * Pcre2Regex.h / .cpp — backend using pcre2_compile_8/match_8 with JIT
  * tests/Pcre2RegexTest.cpp — 12 GTest cases including lookahead +
    backref demos (features PCRE2 supports but RE2 doesn't)

Key design choices:
  * PCRE2_UTF + PCRE2_UCP for Unicode-aware semantics
  * pcre2_jit_compile_8 with PCRE2_JIT_COMPLETE; falls back to interpreter
    on platforms without JIT
  * pcre2_substitute_8 with PCRE2_SUBSTITUTE_GLOBAL +
    PCRE2_SUBSTITUTE_EXTENDED — natively handles Java $N / ${name}
    replacement syntax, no translation layer needed
  * Java pattern syntax (?<name>...) accepted natively by PCRE2; no
    pre-flight scanner or translation layer in this class (Java-specific
    features like \p{InGreek}, character-class intersection, or (?U)
    flag will surface as test failures, documenting the need for a
    future Java->PCRE2 translator — cf. pcre4j PR facebookincubator#606)

Test run (in-tree):
  $ ctest --test-dir cmake-build-debug -R velox_regex_compat_test
  [==========] 23 tests from 2 test suites ran. (2 ms total)
  [  PASSED  ] 23 tests.
    - 11 Re2RegexTest (incl. javaNamedGroup + globalReplace via Velox
      prepareRegexpReplacePattern / prepareRegexpReplaceReplacement)
    - 12 Pcre2RegexTest (incl. lookaheadSupported + backrefSupported
      demonstrating PCRE2's expanded language coverage)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Third and final regex backend.  Uses JNI_CreateJavaVM to boot an
embedded JVM in the test process and drives java.util.regex.Pattern /
Matcher through JNI calls.  Java is the canonical source of truth for
the Java-regex semantics that the other two backends approximate.

Files:
  * JvmFixture.h / .cpp — process-singleton JVM owner; GTest
    GlobalEnvironment registered via JvmFixture::Register() in main
  * JavaRegex.h / .cpp — JavaRegex class; constructs by calling
    Pattern.compile, runs find()/matches()/lookingAt()/replaceAll
    via cached jmethodID's
  * tests/TestMain.cpp — entry point that registers the JVM fixture
    when VELOX_REGEX_COMPAT_HAS_JAVA=1
  * tests/JavaRegexTest.cpp — 13 GTest cases including
    javaSpecificPropertyInLC (\p{InGreek}) which works natively in
    Java but won't in PCRE2 without a future Java->PCRE2 translator
  * CMakeLists.txt: conditional source list controlled by
    VELOX_ENABLE_REGEX_COMPAT_JAVA_BACKEND (uses velox_compile_definitions
    + velox_include_directories — target_* fails on ALIAS targets)

Notable JNI details:
  * Class refs (Pattern, Matcher, Map, Set, Iterator, Entry, Integer,
    String) cached as NewGlobalRef
  * jmethodID's cached at first construction via std::call_once
  * Pattern.namedGroups() (JDK 20+) probed optionally — falls back to
    empty named map if unavailable
  * Match() materialises input[0..endpos) to a Java String (Java's
    Matcher needs a CharSequence; we'd need a custom CharSequence impl
    to avoid the copy, not worth it for tests)
  * GlobalReplace counts matches via find()-loop before calling
    replaceAll — the public API doesn't expose the substitution count

Test run (in-tree, JNI ON):
  $ cmake-build-debug/.../velox_regex_compat_test
  [==========] 36 tests from 3 test suites ran. (193 ms total)
  [  PASSED  ] 36 tests.
    - 11 Re2RegexTest    (RE2 via Velox prepare* translation)
    - 12 Pcre2RegexTest  (PCRE2 native Java-syntax acceptance)
    - 13 JavaRegexTest   (java.util.regex via embedded JVM)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ase 7)

GTest TYPED_TEST_SUITE templated on a type list of all enabled
backends.  One TYPED_TEST() declaration compiles into N test
instances at compile time (N = 2 if JNI off, 3 if JNI on), so a
single test body validates that all three engines deliver identical
behaviour for the same Java input.

Files:
  * tests/BackendTestBase.h — AllBackends typelist + BackendTest fixture
  * tests/BackendTypedTest.cpp — 13 typed cases covering compile, match
    (unanchored/anchored), find, fullPartialMatch, globalReplace (both
    $N and ${name}), case-insensitive, dotAll, multiline anchors,
    empty-group sentinel

Also fixes a Re2Regex bug discovered by the new multiline typed test:
  * Java MULTILINE doesn't map to any single RE2 Options bit; RE2's
    `one_line=false` (the default) still requires inline `(?m)` to
    make `^`/`$` match around \n.  Re2Regex now prefixes `(?m)`
    when `opt.oneLine == false`, mirroring Java's MULTILINE semantics
    additively without affecting `.` or other metas.

Test result:
  [==========] 75 tests from 6 test suites ran. (86 ms total)
  [  PASSED  ] 75 tests.
    - 11 Re2RegexTest (Re2-specific)
    - 12 Pcre2RegexTest (PCRE2-specific incl. lookahead/backref)
    - 13 JavaRegexTest (Java-specific incl. \p{InGreek})
    - 39 BackendTest/* — 13 typed-cases x 3 backends, cross-checked

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds JavaMatcherAdapter<R> (test-only header) reconstructing the Java
Matcher state machine (find/group/start/end/replaceAll/replaceFirst)
on top of the backend's stateless Match() API.  Then ports
representative cases from pcre4j's PatternTests.java,
MatcherMatchingTests.java, and MatcherReplacementTests.java as typed
tests that run against every enabled backend.

Files added:
  * tests/JavaMatcherAdapter.h
  * tests/PatternPortedTest.cpp (13 cases x N backends)
  * tests/MatcherMatchingPortedTest.cpp (14 cases x N backends)
  * tests/MatcherReplacementPortedTest.cpp (11 cases x N backends)

Fixes a real bug in JavaRegex along the way:
  * splitUnicodeDelimiters surfaced that Java's Matcher.region() takes
    UTF-16 char offsets, not UTF-8 bytes, and Matcher.start()/end()
    returns char offsets too.  Added javaCharOffsetToByteOffset and
    byteOffsetToJavaCharOffset helpers in JavaRegex.cpp and routed
    region() / start() / end() through them so the backend correctly
    handles non-ASCII input.

Test result (in-tree, JNI ON):
  [==========] 189 tests from 15 test suites ran. (97 ms total)
  [  PASSED  ] 189 tests.

  Breakdown:
    - 11 Re2RegexTest        (RE2 backend specifics)
    - 12 Pcre2RegexTest      (PCRE2 specifics incl. lookahead/backref)
    - 13 JavaRegexTest       (Java specifics incl. \p{InGreek})
    - 39 BackendTest         (13 typed x 3 backends, core API)
    - 39 PatternPortedTest   (13 typed x 3 backends, pcre4j Pattern)
    - 42 MatchingPortedTest  (14 typed x 3 backends, pcre4j Matcher.find)
    - 33 ReplacementPortedTest (11 typed x 3 backends, pcre4j replace*)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ally

Ports the 15 remaining cases from pcre4j MatcherMatchingTests.java
(region-bounded find/matches/lookingAt, multiple-find-in-region,
empty/unmatched group, lookaround, zero-width edges).  Also extends
JavaMatcherAdapter with region(int,int) and honors region bounds in
find/matches/lookingAt.

Test methodology change per user requirement: tests now assert
Java-canonical behaviour with no `if constexpr` per-backend
branching.  Backends that diverge from Java just fail; failures are
the *data*, not bugs.

New TestMain reports a per-backend compatibility rate at end of run:

  ========== Per-backend compatibility rate ==========
    JavaRegex (typed)  67/67  (100%)         ← ground truth
    Pcre2Regex (typed) 66/67  (98.5%)        ← braceQuantifierIncomplete
    Re2Regex  (typed)  63/67  (94.0%)        ← lookaround x2 + brace + zw-region
  ====================================================

The 5 FAIL cases are real engine differences:
  * braceQuantifierIncomplete — Java rejects 'a{'; PCRE2/RE2 accept literally
  * positiveLookaround / positiveUnmatchedLookaround — RE2 lacks lookaround
  * findWithZeroWidthMatchExhaustsRegion — RE2/PCRE2 $ ignores region end

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add appendReplacement/appendTail/quoteReplacement to JavaMatcherAdapter.
- Port 24 cases from pcre4j MatcherReplacementTests.java covering
  quoteReplacement, replaceAll, replaceFirst, appendReplacement walks,
  named groups, escaped chars, error cases.
- Fix JavaRegex.toJString/fromJString to transcode through UTF-16 instead
  of using NewStringUTF/GetStringUTFChars (modified-UTF-8), which
  mis-encoded 4-byte UTF-8 sequences (supplementary chars). Round-trips
  emoji like U+1F310 / U+1F30D correctly now.
- Add Java=100% guard in TestMain that loudly warns on stderr if the
  Java backend ever drops below 100%.

Current matrix:
  JavaRegex  94/94 (100%)
  Pcre2Regex 93/94 (98.9%) — brace strictness
  Re2Regex   90/94 (95.7%) — lookaround + brace + zero-width region

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Java's Matcher.results() returns a Stream<MatchResult>; modelled here as
a find()-loop that snapshots (start, end, group) per match.  Skipped 1
case (resultsStreamOperations) that only exercises Java's stream API.

Matrix:
  Java  106/106 (100%)
  Pcre2 105/106 (99.06%)
  Re2   101/106 (95.28%) — +1 fail on resultsZeroWidthMatches (lookahead)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…f 9)

Most MatcherMatchResultTests.java cases exercise Java's MatchResult
snapshot semantics (immutability, IllegalStateException/IOOBE/IAE
contracts, namedGroups() map equality, hasMatch() flag) — pure
Java API-contract tests, no regex-engine signal.  Ported only the
two cases that exercise engine behavior not already covered:

  * matchResultByGroupNumber       — 3 sibling captures, full
                                     groupCount/group(i)/start(i)/end(i)
                                     sweep.
  * matchResultNamedGroupAccessors — 3 named groups in a date pattern.

Adapter: add start(name)/end(name) overloads to mirror group(name).

Matrix:
  Java  108/108 (100%)
  Pcre2 107/108 (99.07%)
  Re2   103/108 (95.37%)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Java Pattern.split implemented as a free helper that drives the
backend's find() loop via JavaMatcherAdapter, so any engine differences
in find/match propagate naturally into split output.

Skipped:
  * splitWithDelimiters* (3 cases) — Java 21+ API, not in our embedded
    JDK 17.

Matrix:
  Java  121/121 (100%)
  Pcre2 120/121 (99.17%)
  Re2   116/121 (95.87%)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Multi-byte UTF-8 literals (1/2/3/4-byte chars) + combined + region.
Offsets translated from Java UTF-16 char positions to UTF-8 byte
positions (e.g., region(3,5) over the surrogate pair for 🌍 →
byte region(7,11)).

Matrix:
  Java  127/127 (100%)
  Pcre2 126/127 (99.21%)
  Re2   122/127 (96.06%)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…aining)

Added 6 PatternTests.java cases:
  * quote                              — Pattern.quote + \Q...\E round-trip
  * commentsWhitespaceIgnored          — (?x) ignores unescaped spaces
  * commentsHashComments               — (?x) # to end-of-line is a comment
  * commentsEscapedWhitespace          — \ <space> matched literally
  * commentsWhitespaceInCharacterClass — [\ ] matches literal space
  * commentsEmbeddedFlag               — (?x) at start of pattern

PatternTests.java cases intentionally NOT ported (out of engine-compat scope):
  * toStringReturnsPattern, asPredicate*/asMatchPredicate*/splitAsStream
    — Java functional/utility API only, no engine behavior.
  * canonEq* (24 cases)
    — CANON_EQ flag explicitly out of scope per spec section 1.2.
  * unicodeCase*/flag-introspection (~7)
    — Java's UNICODE_CASE/UNICODE_CHARACTER_CLASS/UNIX_LINES need
      backend-specific flag plumbing (RE2/PCRE2 don't have inline-flag
      equivalents with the same semantics); skip rather than build a
      partial framework that can't represent the divergence faithfully.
  * matchesStatic* — duplicate of existing FullMatch coverage.
  * translatorThrownPatternSyntaxException — pcre4j internal.

Matrix:
  Java  133/133 (100%)
  Pcre2 132/133 (99.25%) — brace strictness
  Re2   123/133 (92.48%) — lookaround, brace, zero-width region, 4 (?x) sub-cases

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fetches OpenJDK 17 (jdk-17.0.13-ga) regex test corpus at CMake configure
time (SHA256 pinned), parses it mirror-faithfully to OpenJDK's own
RegExTest.processFile / grabLine, and reports per-backend compat rate.

Comparison approach: rebuild the result string in the exact shape
OpenJDK's RegExTest produces ("true <g0> <gc> <g1> <g2>..." or
"false <gc>") and string-equal it against the corpus expected line.
This sidesteps ambiguous whitespace-in-match-text parsing.

processEscapes mirrors OpenJDK grabLine: only \n and \uXXXX are
processed (everything else passes through as a regex meta-escape).
Surrogate pairs \uD8##\uDC## are combined into the supplementary
code point and encoded as 4-byte UTF-8, so RE2/PCRE2 see valid UTF-8
and JavaRegex's JNI bridge re-splits to a surrogate pair.

Patterns whose expected line begins with "error" map to compile-error
in our backends → counted as PASS (Java agrees they're invalid).

OpenJDK corpus matrix:
  Java   299/299 (100%)
  Pcre2  217/299 (72.58%)  [36 compile-err on Java-specific syntax]
  Re2    170/299 (56.86%)  [83 compile-err: lookaround, backrefs, ...]

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
OpenJDK's RegExTest.processFile runs three corpus files; we were only
loading TestCases.txt.  Now loads & reports BMPTestCases.txt and
SupplementaryTestCases.txt as well (per-file rates + per-backend
aggregate).

Per-file breakdown:
  Java|TestCases.txt               299/299 (100.00%)
  Java|BMPTestCases.txt            222/222 (100.00%)
  Java|SupplementaryTestCases.txt  339/339 (100.00%)
  Pcre2|TestCases.txt              217/299 (72.58%)
  Pcre2|BMPTestCases.txt           154/222 (69.37%)
  Pcre2|SupplementaryTestCases.txt 176/339 (51.92%)
  Re2|TestCases.txt                170/299 (56.86%)
  Re2|BMPTestCases.txt             126/222 (56.76%)
  Re2|SupplementaryTestCases.txt   232/339 (68.44%)

Aggregate (all 860 cases × 3 backends):
  Java  860/860 (100.00%)
  Pcre2 547/860 (63.60%)
  Re2   528/860 (61.40%)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
New library velox/functions/lib/java_pcre2_translator/ (will host the
C++ port of pcre4j PR facebookincubator#606's org.pcre4j.regex.translate module).
This commit adds:

  * CMakeLists.txt registering velox_java_pcre2_translator
  * JavaRegexTranslator.{h,cpp} — public surface declaration plus a
    Phase-1 identity-passthrough implementation of toPcre2Pattern
  * EvaluationFailedException.h
  * LICENSE-NOTICE.md documenting the pcre4j → Velox re-licensing
  * tests/JavaRegexTranslatorTest.cpp — smoke test (identity + ctor)
  * Wires the new subdir into velox/functions/lib/CMakeLists.txt

No behavior change: regex-compat corpus rates unchanged
  Java 860/860 (100.00%) | Pcre2 547/860 (63.60%) | Re2 528/860 (61.40%)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1:1 port of org.pcre4j.regex.translate.RangeSet (257 LOC Java →
~210 LOC C++).  Inlined a private emitLiteralInClass helper to avoid
a circular dep on the (yet to be ported) ClassRenderer module; the
duplication will go away (or be intentional) when Phase 4 lands.

Tests ported 1:1 from RangeSetTest.java (212 LOC → 23 GTest cases).
All 23 pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1:1 port of org.pcre4j.regex.translate.PropertyMap (185 LOC Java →
~190 LOC C++).  Static lookup table from Java property names to PCRE2
equivalents (\p{javaXxx}, \p{InGreek}, \p{IsL}, posix Lower/Upper/...).

Tests ported 1:1 from PropertyMapTest.java (68 LOC → 9 GTest cases).
All pass.

NOTE: JdkPropertyExpander (the other Phase-3 file in the original plan,
277 LOC) requires a full-Unicode codepoint scan via JDK Character API.
It's deferred to Phase 4 where it plugs into the Evaluator.  Either an
ICU-based implementation or a pre-generated static table will land
together with the AST/parser/evaluator work.

Suite totals: 34 tests (2 scaffolding + 23 RangeSet + 9 PropertyMap).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…+ JdkPropertyExpander

Port the char-class AST, recursive parser, evaluator, renderer, and JdkPropertyExpander from pcre4j Java to Velox C++ without wiring to toPcre2Pattern.\n\nJdkPropertyExpander uses Velox's existing ICU dependency (u_charType/uscript_getScript/ublock_getCode/binary properties) rather than generated tables or a stub.\n\nValidation: velox_java_pcre2_translator_test passes 121/121; regex compat Java corpus remains 860/860 (100.00%).\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Port the final top-level Java regex translation pipeline: property rewrites, character-class parsing/rendering, Java inline-flag normalization, escape handling, and conservative EvaluationFailedException paths for cases PCRE2 cannot safely express.

Add translator pipeline coverage and regressions for block-qualified properties, comments-mode braces, flag scoping, cased-property folding, counted closures, long backreferences, and surrogate-pair unicode escapes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Pcre2Regex constructor now translates the incoming Java regex pattern
to PCRE2 syntax via java_pcre2_translator::toPcre2Pattern before
calling pcre2_compile.  EvaluationFailedException is caught and the
message surfaced verbatim via error_ (ok()=false).

OpenJDK corpus impact:
  Pcre2  547/860 (63.60%) → 762/860 (88.60%)   +25 pp
  Pcre2 compile-errs 165 → 76                  -89

Java backend unchanged at 860/860 (100%); Re2 unchanged at 528/860
(61.40%) — RE2 wiring is Phase 7.

The 76 remaining Pcre2 compile-errors are corpus patterns using
features the translator still passes through verbatim (e.g.,
character classes with intersection operands that include script
ranges PCRE2 doesn't recognise, edge cases in posix class
expansions, etc.).  Will be incrementally addressed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add toRe2Pattern using the Java regex translator pipeline for Unicode property and character-class normalization, Java named-group rewriting, Java COMMENTS-mode translation, and RE2 octal escape rendering. Detect unsupported RE2 features up front, including lookaround, backreferences, possessive quantifiers, atomic groups, and Java flags without RE2 equivalents.

OpenJDK corpus:
  Java   860/860 (100.00%)
  Pcre2  762/860 (88.60%)  (unchanged from Phase 6)
  Re2    755/860 (87.79%)  (was 528/860 / 61.40%)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
copilot and others added 10 commits June 1, 2026 17:25
Expose a translator side channel for raw surrogate byte mode and rewrite surrogate escape aliases to byte-sequence PCRE2 before compiling without PCRE2_UTF.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Port Java property and Unicode block materialization from the pcre4j fork, including surrogate block sentinel handling and class parser/renderer updates. Adjust the PCRE2 backend for Java default shorthand semantics and raw surrogate UTF-8 matching used by the OpenJDK corpus.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…gate failures

Brings PCRE2 OpenJDK corpus to 860/860 (100.00%), matching Java.

Key changes:
* translator: lone-surrogate property tokens (\p{InHIGH_SURROGATES},
  \p{InLOW_SURROGATES}, surrogate codepoints in \x{...}) now translated to
  raw-byte CESU-8 sequences in PCRE2 raw-byte mode
* Pcre2Regex: when raw-byte mode is selected, char-class ranges that include
  surrogates are rewritten to byte-sequence regex matching CESU-8 encoding
* PropertyMap / ClassRenderer / RangeSet: minor tweaks to support the
  surrogate handling path

Validation:
  velox_java_pcre2_translator_test:  174/174
  velox_regex_compat_test:           445 passed, 4 expected skips
  OpenJDK corpus:
    Java   860/860 (100.00%)
    Pcre2  860/860 (100.00%)  ← was 854/860
    Re2    757/860 ( 88.02%)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Port 80 active RegExTest bodies to the typed regex-compat suite and add explicit TODO skips for Java-specific or adapter-limited cases. The port records per-backend compatibility rates while preserving Java as the canonical pass requirement.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
After P1.8 added 124 RegExTest cases with 28 GTEST_SKIP entries
(Java-API-only behaviour with no C++ adapter equivalent), the per-
backend tally listener was counting skipped tests as not-passing,
making Java backend appear to drop from 100% to 89% (231/259).

Treat Skipped() as neither pass nor fail.  Skipped count is reported
separately for visibility:

    JavaRegex (typed)  231 / 231 (100%)  [skipped: 28]
    Pcre2Regex(typed)  231 / 231 (100%)  [skipped: 28]
    Re2Regex  (typed)  227 / 227 (100%)  [skipped: 32]

OpenJDK corpus + RegExTest separate-tally blocks (which use their
own reporters, not GTest pass/fail) are unaffected.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ssible patterns

When a corpus / RegExTest pattern uses features the target engine cannot
support natively (RE2 lookaround, backref, possessive, atomic; PCRE2 has
none in this category — its translator-rejected list is empty), the
translator throws EvaluationFailedException and the backend marks the
pattern as not-ok with an error prefix "Java→RE2 translator: ...".
These are engine ceilings, not implementation bugs.

Both OpenJDK corpus and RegExTest reports now print an additional
"translatable subset" line that excludes such patterns:

  OpenJDK corpus:
    Re2                          757 / 860 (88.02%)  [compile-err: 99]
    Re2 (translatable subset)    757 / 771 (98.18%)  [excludes 89 translator-rejected]

  RegExTest ported:
    Re2                          84 / 96 (87.50%)
    Re2 (translatable subset)    84 / 93 (90.32%)  [excludes 3 translator-rejected]

Implementation:
- OpenJdkCorpusDiffTest tracks translatorRejected based on the
  Re2Regex/Pcre2Regex error_ string prefix.
- RegExTestPortedTest adds a thread_local flag tlsTranslatorRejected
  set by a notePatternStatus helper called from every find/match
  helper; the PORTED_REGEX_TEST macro resets it before each test
  body and recordCase consumes it afterwards.

Java / Pcre2 corpus remain 100.00%; Re2 raw rate unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…pply pre-commit formatting

CI on PR facebookincubator#17685 was failing on:

* GCC 14 -Werror=dangling-pointer in ClassBodyParserTest:
  auto* inter = parse("...").getIf<Intersection>();   // dangling
  Bound to a named local first instead.

* pre-commit: license-header / clang-format / gersemi
  Ran 'pre-commit run --files ...' over the touched files; only formatting
  changes (no behavioural diff).  Verified:
    velox_java_pcre2_translator_test:  178/178 passing
    OpenJDK corpus:  Java/Pcre2 860/860, Re2 757/860 (98.18% translatable subset)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@baibaichen baibaichen force-pushed the feat/regex-compat-pcre2-cmake branch from 082c3b0 to d4fdfe6 Compare June 1, 2026 17:52
@mbasmanova mbasmanova requested review from amitkdutta and kgpai June 2, 2026 00:20
Copy link
Copy Markdown
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the thorough proposal and for addressing each of the historical concerns from #9897 and #9823. The approach is reasonable — PCRE2 with built-in limits is a better answer than Oniguruma or Hyperscan, and the translator improving RE2's corpus pass-rate from 61% to 88% is valuable on its own.

However, 12K lines in a single PR is not reviewable. This contains three independent components that should be separate PRs:

  1. Java→PCRE2/RE2 translator (velox/functions/lib/java_pcre2_translator/) — builds unconditionally, benefits RE2 immediately, no new dependencies.
  2. PCRE2 engine wrapper (velox/external/regex_compat/) — the new engine, gated behind VELOX_ENABLE_PCRE2_BACKEND.
  3. Test harness (velox/external/regex_compat/tests/) — the JVM-oracle-based test framework.

Items 2 and 3 need broader discussion before we commit to them. Adding a second regex engine is a significant maintenance burden — two code paths to test, debug, and keep compatible. The opt-in CMake flag means CI either tests both paths (doubling regex test matrix) or only tests RE2 (leaving PCRE2 undertested). We'd like to discuss the long-term ownership and maintenance plan before merging the engine itself.

Please split accordingly. The translator PR alone would be a significant contribution and can be reviewed independently. The PCRE2 wrapper and test harness can follow once the translator lands.

For the translator PR: please add documentation (README or header-level docs) explaining what transformations it performs, what it doesn't handle, and the testing strategy. The 178 unit tests are there, but a reader needs to understand what coverage they provide — are these ported 1:1 from pcre4j? Do they cover the transformations that matter most for Spark/Presto patterns?

Also: before the translator PR, we'd like to understand the integration plan. How will existing Velox functions (regexp_extract, regexp_replace, regexp_like, like, Spark split) use the translator and/or PCRE2? Is this a per-function opt-in, a global config, or a per-query setting? The translator library is useful, but the review needs to understand how it connects to the rest of the system.

copilot and others added 2 commits June 2, 2026 04:04
CI 'Build with GCC / Linux release with adapters' was failing because:
* clang-tidy scans every diff-touched header in isolation; JavaRegex.h /
  JvmFixture.h unconditionally include <jni.h>, which is absent on hosts
  without a JDK.
* RegExTestPortedTest.cpp's PORTED_REGEX_TEST macro and
  OpenJdkCorpusDiffTest.cpp's Grapheme test both reference the JavaRegex
  symbol unconditionally; when JAVA_BACKEND is OFF the symbol is not
  declared and compilation fails.

Fix:
* Wrap the bodies of JavaRegex.h and JvmFixture.h in
  #if VELOX_REGEX_COMPAT_HAS_JAVA so they compile to nothing when the
  Java backend is off.
* In RegExTestPortedTest.cpp, gate the 'Java backend regression' EXPECT
  in PORTED_REGEX_TEST behind the same macro.
* In OpenJdkCorpusDiffTest.cpp, gate toJString / directJavaGraphemeBreakOffsets
  helpers and every 'is_same_v<TypeParam, JavaRegex>' branch behind the
  macro.

Verified both build paths locally:
  -DVELOX_ENABLE_REGEX_COMPAT_JAVA_BACKEND=ON  -> regex_compat_test passes
                                                  (Java/Pcre2 860/860, Re2 757/860)
  -DVELOX_ENABLE_REGEX_COMPAT_JAVA_BACKEND=OFF -> builds clean

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…_COMPAT_HAS_JAVA

Follow-up to previous commit: the .h files were guarded but the .cpp
files still unconditionally use JNI types (jint, jclass, ...) which
clang-tidy on hosts without a JDK cannot resolve, causing 'Build with
GCC / Linux release with adapters' to fail.

Wrap each .cpp body in #if VELOX_REGEX_COMPAT_HAS_JAVA / #endif so they
compile to an empty translation unit when the Java backend is off.

Verified locally:
  JAVA_BACKEND=ON: regex_compat_test passes (Java/Pcre2 860/860,
                   Re2 757/860 / 98.18% translatable subset)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants