feat: Add opt-in PCRE2 regex engine and Java-regex compatibility test suite#17685
feat: Add opt-in PCRE2 regex engine and Java-regex compatibility test suite#17685baibaichen wants to merge 38 commits into
Conversation
✅ Deploy Preview for meta-velox canceled.
|
Build Impact AnalysisFull build recommended. Files outside the dependency graph changed:
These directories are not fully covered by the dependency graph. A full build is the safest option. Slow path • Graph generated from PR branch |
Adds:
* CMake/Findpcre2.cmake — system libpcre2-8 detection (CONFIG → pkg-config)
* CMake/resolve_dependency_modules/pcre2.cmake
— FetchContent fallback (PCRE2 10.45, 8-bit only, JIT on)
* VELOX_ENABLE_REGEX_COMPAT_TESTS option (default OFF)
* conditional velox_set_source/resolve_dependency(pcre2) wiring
Verified:
* default OFF config succeeds, no PCRE2 in build graph
* -DVELOX_ENABLE_REGEX_COMPAT_TESTS=ON fetches sha256
0e138387df7835d7403b8351e2226c1377da804e0737db0e071b48f07c9d12ee
and links libpcre2-8.a (965 KB)
This is the build-infrastructure layer of the design in
docs/superpowers/specs/2026-05-29-pcre2-cpp-test-suite-design.md.
The helper layer and ported tests follow in subsequent commits.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Verified:
* curl + sha256sum on pcre2-10.47.tar.gz =
c08ae2388ef333e8403e670ad70c0a11f1eed021fd88308d7e02f596fcd9dc16
* FetchContent + build pcre2-8-static target succeeds
* libpcre2-8.a links cleanly (980 KB, JIT enabled)
* configure reports PACKAGE_VERSION='10.47'
Spec and plan docs updated in sync; fallback target on build
failure noted as 10.45.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…X_COMPAT_JAVA_BACKEND)
When VELOX_ENABLE_REGEX_COMPAT_TESTS=ON, also try find_package(JNI):
* found → log STATUS, enable an embedded-JVM Java backend in the
forthcoming regex-compat test suite (3rd backend / live oracle)
* not found → log WARNING and auto-disable, suite still builds with
PCRE2 + RE2 only
Default ON, fully gated under the (default-OFF) parent option, so stock
Velox builds are unaffected. This is the only place in upstream Velox
that touches JNI.
Verified:
* parent OFF → JNI probe never runs
* parent ON, JDK present → 'Regex-compat: enabling embedded-JVM Java
backend (JNI: /usr/lib/jvm/default-java/include;...)' appears
* parent ON, this option explicit OFF → probe skipped silently
Auto-degrade branch (JNI absent) is implemented per standard CMake
idiom but not exercised in this commit's verification (find_package
falls back to /usr/lib/jvm/* system paths even with bogus JAVA_HOME).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
First of three regex backends for the Java-regex compatibility test
suite (Option E: three parallel concrete classes, no virtual, names
shaped after re2::RE2 subset, semantics aligned with java.util.regex).
Files:
* velox/external/regex_compat/RegexTypes.h — shared Anchor + Options
* velox/external/regex_compat/Re2Regex.h — public class
* velox/external/regex_compat/Re2Regex.cpp — RE2 wrap, reuses Velox
prepareRegexpReplacePattern / prepareRegexpReplaceReplacement
(Re2Functions.h:402,422) for Java -> RE2 translation
* velox/external/regex_compat/CMakeLists.txt — opt-in static lib
* velox/external/regex_compat/tests/Re2RegexTest.cpp — 11 GTest cases
* velox/CMakeLists.txt — add_subdirectory under VELOX_ENABLE_REGEX_COMPAT_TESTS
Verified (logic-level): standalone compile + smoke-test outside the
full Velox build (system libre2 + stubbed prepare*) shows 17/18 PASS
on the subset that doesn't depend on Java-syntax translation; the 1
remaining case was a test-assertion bug (RE2 error text is 'invalid
perl operator: (?=' not 'lookahead'), already fixed in this commit.
Full in-tree CMake build of velox_regex_compat_test is unverified in
this session — the from-scratch Velox build (boost/folly/abseil...)
takes hours, exceeding session budget. Next session should:
cmake -S . -B build -GNinja -DVELOX_ENABLE_REGEX_COMPAT_TESTS=ON \
-DVELOX_ENABLE_REGEX_COMPAT_JAVA_BACKEND=OFF
cmake --build build --target velox_regex_compat_test
ctest --test-dir build -R velox_regex_compat_test --output-on-failure
Pcre2Regex + JavaRegex backends follow in subsequent commits.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Second of three regex backends. Same Option E architecture as
Re2Regex: standalone concrete class, method names mirror re2::RE2,
input accepts Java java.util.regex syntax.
Files:
* Pcre2Regex.h / .cpp — backend using pcre2_compile_8/match_8 with JIT
* tests/Pcre2RegexTest.cpp — 12 GTest cases including lookahead +
backref demos (features PCRE2 supports but RE2 doesn't)
Key design choices:
* PCRE2_UTF + PCRE2_UCP for Unicode-aware semantics
* pcre2_jit_compile_8 with PCRE2_JIT_COMPLETE; falls back to interpreter
on platforms without JIT
* pcre2_substitute_8 with PCRE2_SUBSTITUTE_GLOBAL +
PCRE2_SUBSTITUTE_EXTENDED — natively handles Java $N / ${name}
replacement syntax, no translation layer needed
* Java pattern syntax (?<name>...) accepted natively by PCRE2; no
pre-flight scanner or translation layer in this class (Java-specific
features like \p{InGreek}, character-class intersection, or (?U)
flag will surface as test failures, documenting the need for a
future Java->PCRE2 translator — cf. pcre4j PR facebookincubator#606)
Test run (in-tree):
$ ctest --test-dir cmake-build-debug -R velox_regex_compat_test
[==========] 23 tests from 2 test suites ran. (2 ms total)
[ PASSED ] 23 tests.
- 11 Re2RegexTest (incl. javaNamedGroup + globalReplace via Velox
prepareRegexpReplacePattern / prepareRegexpReplaceReplacement)
- 12 Pcre2RegexTest (incl. lookaheadSupported + backrefSupported
demonstrating PCRE2's expanded language coverage)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Third and final regex backend. Uses JNI_CreateJavaVM to boot an
embedded JVM in the test process and drives java.util.regex.Pattern /
Matcher through JNI calls. Java is the canonical source of truth for
the Java-regex semantics that the other two backends approximate.
Files:
* JvmFixture.h / .cpp — process-singleton JVM owner; GTest
GlobalEnvironment registered via JvmFixture::Register() in main
* JavaRegex.h / .cpp — JavaRegex class; constructs by calling
Pattern.compile, runs find()/matches()/lookingAt()/replaceAll
via cached jmethodID's
* tests/TestMain.cpp — entry point that registers the JVM fixture
when VELOX_REGEX_COMPAT_HAS_JAVA=1
* tests/JavaRegexTest.cpp — 13 GTest cases including
javaSpecificPropertyInLC (\p{InGreek}) which works natively in
Java but won't in PCRE2 without a future Java->PCRE2 translator
* CMakeLists.txt: conditional source list controlled by
VELOX_ENABLE_REGEX_COMPAT_JAVA_BACKEND (uses velox_compile_definitions
+ velox_include_directories — target_* fails on ALIAS targets)
Notable JNI details:
* Class refs (Pattern, Matcher, Map, Set, Iterator, Entry, Integer,
String) cached as NewGlobalRef
* jmethodID's cached at first construction via std::call_once
* Pattern.namedGroups() (JDK 20+) probed optionally — falls back to
empty named map if unavailable
* Match() materialises input[0..endpos) to a Java String (Java's
Matcher needs a CharSequence; we'd need a custom CharSequence impl
to avoid the copy, not worth it for tests)
* GlobalReplace counts matches via find()-loop before calling
replaceAll — the public API doesn't expose the substitution count
Test run (in-tree, JNI ON):
$ cmake-build-debug/.../velox_regex_compat_test
[==========] 36 tests from 3 test suites ran. (193 ms total)
[ PASSED ] 36 tests.
- 11 Re2RegexTest (RE2 via Velox prepare* translation)
- 12 Pcre2RegexTest (PCRE2 native Java-syntax acceptance)
- 13 JavaRegexTest (java.util.regex via embedded JVM)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ase 7)
GTest TYPED_TEST_SUITE templated on a type list of all enabled
backends. One TYPED_TEST() declaration compiles into N test
instances at compile time (N = 2 if JNI off, 3 if JNI on), so a
single test body validates that all three engines deliver identical
behaviour for the same Java input.
Files:
* tests/BackendTestBase.h — AllBackends typelist + BackendTest fixture
* tests/BackendTypedTest.cpp — 13 typed cases covering compile, match
(unanchored/anchored), find, fullPartialMatch, globalReplace (both
$N and ${name}), case-insensitive, dotAll, multiline anchors,
empty-group sentinel
Also fixes a Re2Regex bug discovered by the new multiline typed test:
* Java MULTILINE doesn't map to any single RE2 Options bit; RE2's
`one_line=false` (the default) still requires inline `(?m)` to
make `^`/`$` match around \n. Re2Regex now prefixes `(?m)`
when `opt.oneLine == false`, mirroring Java's MULTILINE semantics
additively without affecting `.` or other metas.
Test result:
[==========] 75 tests from 6 test suites ran. (86 ms total)
[ PASSED ] 75 tests.
- 11 Re2RegexTest (Re2-specific)
- 12 Pcre2RegexTest (PCRE2-specific incl. lookahead/backref)
- 13 JavaRegexTest (Java-specific incl. \p{InGreek})
- 39 BackendTest/* — 13 typed-cases x 3 backends, cross-checked
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds JavaMatcherAdapter<R> (test-only header) reconstructing the Java
Matcher state machine (find/group/start/end/replaceAll/replaceFirst)
on top of the backend's stateless Match() API. Then ports
representative cases from pcre4j's PatternTests.java,
MatcherMatchingTests.java, and MatcherReplacementTests.java as typed
tests that run against every enabled backend.
Files added:
* tests/JavaMatcherAdapter.h
* tests/PatternPortedTest.cpp (13 cases x N backends)
* tests/MatcherMatchingPortedTest.cpp (14 cases x N backends)
* tests/MatcherReplacementPortedTest.cpp (11 cases x N backends)
Fixes a real bug in JavaRegex along the way:
* splitUnicodeDelimiters surfaced that Java's Matcher.region() takes
UTF-16 char offsets, not UTF-8 bytes, and Matcher.start()/end()
returns char offsets too. Added javaCharOffsetToByteOffset and
byteOffsetToJavaCharOffset helpers in JavaRegex.cpp and routed
region() / start() / end() through them so the backend correctly
handles non-ASCII input.
Test result (in-tree, JNI ON):
[==========] 189 tests from 15 test suites ran. (97 ms total)
[ PASSED ] 189 tests.
Breakdown:
- 11 Re2RegexTest (RE2 backend specifics)
- 12 Pcre2RegexTest (PCRE2 specifics incl. lookahead/backref)
- 13 JavaRegexTest (Java specifics incl. \p{InGreek})
- 39 BackendTest (13 typed x 3 backends, core API)
- 39 PatternPortedTest (13 typed x 3 backends, pcre4j Pattern)
- 42 MatchingPortedTest (14 typed x 3 backends, pcre4j Matcher.find)
- 33 ReplacementPortedTest (11 typed x 3 backends, pcre4j replace*)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ally
Ports the 15 remaining cases from pcre4j MatcherMatchingTests.java
(region-bounded find/matches/lookingAt, multiple-find-in-region,
empty/unmatched group, lookaround, zero-width edges). Also extends
JavaMatcherAdapter with region(int,int) and honors region bounds in
find/matches/lookingAt.
Test methodology change per user requirement: tests now assert
Java-canonical behaviour with no `if constexpr` per-backend
branching. Backends that diverge from Java just fail; failures are
the *data*, not bugs.
New TestMain reports a per-backend compatibility rate at end of run:
========== Per-backend compatibility rate ==========
JavaRegex (typed) 67/67 (100%) ← ground truth
Pcre2Regex (typed) 66/67 (98.5%) ← braceQuantifierIncomplete
Re2Regex (typed) 63/67 (94.0%) ← lookaround x2 + brace + zw-region
====================================================
The 5 FAIL cases are real engine differences:
* braceQuantifierIncomplete — Java rejects 'a{'; PCRE2/RE2 accept literally
* positiveLookaround / positiveUnmatchedLookaround — RE2 lacks lookaround
* findWithZeroWidthMatchExhaustsRegion — RE2/PCRE2 $ ignores region end
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add appendReplacement/appendTail/quoteReplacement to JavaMatcherAdapter. - Port 24 cases from pcre4j MatcherReplacementTests.java covering quoteReplacement, replaceAll, replaceFirst, appendReplacement walks, named groups, escaped chars, error cases. - Fix JavaRegex.toJString/fromJString to transcode through UTF-16 instead of using NewStringUTF/GetStringUTFChars (modified-UTF-8), which mis-encoded 4-byte UTF-8 sequences (supplementary chars). Round-trips emoji like U+1F310 / U+1F30D correctly now. - Add Java=100% guard in TestMain that loudly warns on stderr if the Java backend ever drops below 100%. Current matrix: JavaRegex 94/94 (100%) Pcre2Regex 93/94 (98.9%) — brace strictness Re2Regex 90/94 (95.7%) — lookaround + brace + zero-width region Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Java's Matcher.results() returns a Stream<MatchResult>; modelled here as a find()-loop that snapshots (start, end, group) per match. Skipped 1 case (resultsStreamOperations) that only exercises Java's stream API. Matrix: Java 106/106 (100%) Pcre2 105/106 (99.06%) Re2 101/106 (95.28%) — +1 fail on resultsZeroWidthMatches (lookahead) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…f 9)
Most MatcherMatchResultTests.java cases exercise Java's MatchResult
snapshot semantics (immutability, IllegalStateException/IOOBE/IAE
contracts, namedGroups() map equality, hasMatch() flag) — pure
Java API-contract tests, no regex-engine signal. Ported only the
two cases that exercise engine behavior not already covered:
* matchResultByGroupNumber — 3 sibling captures, full
groupCount/group(i)/start(i)/end(i)
sweep.
* matchResultNamedGroupAccessors — 3 named groups in a date pattern.
Adapter: add start(name)/end(name) overloads to mirror group(name).
Matrix:
Java 108/108 (100%)
Pcre2 107/108 (99.07%)
Re2 103/108 (95.37%)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Java Pattern.split implemented as a free helper that drives the
backend's find() loop via JavaMatcherAdapter, so any engine differences
in find/match propagate naturally into split output.
Skipped:
* splitWithDelimiters* (3 cases) — Java 21+ API, not in our embedded
JDK 17.
Matrix:
Java 121/121 (100%)
Pcre2 120/121 (99.17%)
Re2 116/121 (95.87%)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Multi-byte UTF-8 literals (1/2/3/4-byte chars) + combined + region. Offsets translated from Java UTF-16 char positions to UTF-8 byte positions (e.g., region(3,5) over the surrogate pair for 🌍 → byte region(7,11)). Matrix: Java 127/127 (100%) Pcre2 126/127 (99.21%) Re2 122/127 (96.06%) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…aining)
Added 6 PatternTests.java cases:
* quote — Pattern.quote + \Q...\E round-trip
* commentsWhitespaceIgnored — (?x) ignores unescaped spaces
* commentsHashComments — (?x) # to end-of-line is a comment
* commentsEscapedWhitespace — \ <space> matched literally
* commentsWhitespaceInCharacterClass — [\ ] matches literal space
* commentsEmbeddedFlag — (?x) at start of pattern
PatternTests.java cases intentionally NOT ported (out of engine-compat scope):
* toStringReturnsPattern, asPredicate*/asMatchPredicate*/splitAsStream
— Java functional/utility API only, no engine behavior.
* canonEq* (24 cases)
— CANON_EQ flag explicitly out of scope per spec section 1.2.
* unicodeCase*/flag-introspection (~7)
— Java's UNICODE_CASE/UNICODE_CHARACTER_CLASS/UNIX_LINES need
backend-specific flag plumbing (RE2/PCRE2 don't have inline-flag
equivalents with the same semantics); skip rather than build a
partial framework that can't represent the divergence faithfully.
* matchesStatic* — duplicate of existing FullMatch coverage.
* translatorThrownPatternSyntaxException — pcre4j internal.
Matrix:
Java 133/133 (100%)
Pcre2 132/133 (99.25%) — brace strictness
Re2 123/133 (92.48%) — lookaround, brace, zero-width region, 4 (?x) sub-cases
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fetches OpenJDK 17 (jdk-17.0.13-ga) regex test corpus at CMake configure
time (SHA256 pinned), parses it mirror-faithfully to OpenJDK's own
RegExTest.processFile / grabLine, and reports per-backend compat rate.
Comparison approach: rebuild the result string in the exact shape
OpenJDK's RegExTest produces ("true <g0> <gc> <g1> <g2>..." or
"false <gc>") and string-equal it against the corpus expected line.
This sidesteps ambiguous whitespace-in-match-text parsing.
processEscapes mirrors OpenJDK grabLine: only \n and \uXXXX are
processed (everything else passes through as a regex meta-escape).
Surrogate pairs \uD8##\uDC## are combined into the supplementary
code point and encoded as 4-byte UTF-8, so RE2/PCRE2 see valid UTF-8
and JavaRegex's JNI bridge re-splits to a surrogate pair.
Patterns whose expected line begins with "error" map to compile-error
in our backends → counted as PASS (Java agrees they're invalid).
OpenJDK corpus matrix:
Java 299/299 (100%)
Pcre2 217/299 (72.58%) [36 compile-err on Java-specific syntax]
Re2 170/299 (56.86%) [83 compile-err: lookaround, backrefs, ...]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
OpenJDK's RegExTest.processFile runs three corpus files; we were only loading TestCases.txt. Now loads & reports BMPTestCases.txt and SupplementaryTestCases.txt as well (per-file rates + per-backend aggregate). Per-file breakdown: Java|TestCases.txt 299/299 (100.00%) Java|BMPTestCases.txt 222/222 (100.00%) Java|SupplementaryTestCases.txt 339/339 (100.00%) Pcre2|TestCases.txt 217/299 (72.58%) Pcre2|BMPTestCases.txt 154/222 (69.37%) Pcre2|SupplementaryTestCases.txt 176/339 (51.92%) Re2|TestCases.txt 170/299 (56.86%) Re2|BMPTestCases.txt 126/222 (56.76%) Re2|SupplementaryTestCases.txt 232/339 (68.44%) Aggregate (all 860 cases × 3 backends): Java 860/860 (100.00%) Pcre2 547/860 (63.60%) Re2 528/860 (61.40%) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
New library velox/functions/lib/java_pcre2_translator/ (will host the C++ port of pcre4j PR facebookincubator#606's org.pcre4j.regex.translate module). This commit adds: * CMakeLists.txt registering velox_java_pcre2_translator * JavaRegexTranslator.{h,cpp} — public surface declaration plus a Phase-1 identity-passthrough implementation of toPcre2Pattern * EvaluationFailedException.h * LICENSE-NOTICE.md documenting the pcre4j → Velox re-licensing * tests/JavaRegexTranslatorTest.cpp — smoke test (identity + ctor) * Wires the new subdir into velox/functions/lib/CMakeLists.txt No behavior change: regex-compat corpus rates unchanged Java 860/860 (100.00%) | Pcre2 547/860 (63.60%) | Re2 528/860 (61.40%) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1:1 port of org.pcre4j.regex.translate.RangeSet (257 LOC Java → ~210 LOC C++). Inlined a private emitLiteralInClass helper to avoid a circular dep on the (yet to be ported) ClassRenderer module; the duplication will go away (or be intentional) when Phase 4 lands. Tests ported 1:1 from RangeSetTest.java (212 LOC → 23 GTest cases). All 23 pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1:1 port of org.pcre4j.regex.translate.PropertyMap (185 LOC Java →
~190 LOC C++). Static lookup table from Java property names to PCRE2
equivalents (\p{javaXxx}, \p{InGreek}, \p{IsL}, posix Lower/Upper/...).
Tests ported 1:1 from PropertyMapTest.java (68 LOC → 9 GTest cases).
All pass.
NOTE: JdkPropertyExpander (the other Phase-3 file in the original plan,
277 LOC) requires a full-Unicode codepoint scan via JDK Character API.
It's deferred to Phase 4 where it plugs into the Evaluator. Either an
ICU-based implementation or a pre-generated static table will land
together with the AST/parser/evaluator work.
Suite totals: 34 tests (2 scaffolding + 23 RangeSet + 9 PropertyMap).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…+ JdkPropertyExpander Port the char-class AST, recursive parser, evaluator, renderer, and JdkPropertyExpander from pcre4j Java to Velox C++ without wiring to toPcre2Pattern.\n\nJdkPropertyExpander uses Velox's existing ICU dependency (u_charType/uscript_getScript/ublock_getCode/binary properties) rather than generated tables or a stub.\n\nValidation: velox_java_pcre2_translator_test passes 121/121; regex compat Java corpus remains 860/860 (100.00%).\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Port the final top-level Java regex translation pipeline: property rewrites, character-class parsing/rendering, Java inline-flag normalization, escape handling, and conservative EvaluationFailedException paths for cases PCRE2 cannot safely express. Add translator pipeline coverage and regressions for block-qualified properties, comments-mode braces, flag scoping, cased-property folding, counted closures, long backreferences, and surrogate-pair unicode escapes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Pcre2Regex constructor now translates the incoming Java regex pattern to PCRE2 syntax via java_pcre2_translator::toPcre2Pattern before calling pcre2_compile. EvaluationFailedException is caught and the message surfaced verbatim via error_ (ok()=false). OpenJDK corpus impact: Pcre2 547/860 (63.60%) → 762/860 (88.60%) +25 pp Pcre2 compile-errs 165 → 76 -89 Java backend unchanged at 860/860 (100%); Re2 unchanged at 528/860 (61.40%) — RE2 wiring is Phase 7. The 76 remaining Pcre2 compile-errors are corpus patterns using features the translator still passes through verbatim (e.g., character classes with intersection operands that include script ranges PCRE2 doesn't recognise, edge cases in posix class expansions, etc.). Will be incrementally addressed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add toRe2Pattern using the Java regex translator pipeline for Unicode property and character-class normalization, Java named-group rewriting, Java COMMENTS-mode translation, and RE2 octal escape rendering. Detect unsupported RE2 features up front, including lookaround, backreferences, possessive quantifiers, atomic groups, and Java flags without RE2 equivalents. OpenJDK corpus: Java 860/860 (100.00%) Pcre2 762/860 (88.60%) (unchanged from Phase 6) Re2 755/860 (87.79%) (was 528/860 / 61.40%) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Expose a translator side channel for raw surrogate byte mode and rewrite surrogate escape aliases to byte-sequence PCRE2 before compiling without PCRE2_UTF. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Port Java property and Unicode block materialization from the pcre4j fork, including surrogate block sentinel handling and class parser/renderer updates. Adjust the PCRE2 backend for Java default shorthand semantics and raw surrogate UTF-8 matching used by the OpenJDK corpus. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…gate failures
Brings PCRE2 OpenJDK corpus to 860/860 (100.00%), matching Java.
Key changes:
* translator: lone-surrogate property tokens (\p{InHIGH_SURROGATES},
\p{InLOW_SURROGATES}, surrogate codepoints in \x{...}) now translated to
raw-byte CESU-8 sequences in PCRE2 raw-byte mode
* Pcre2Regex: when raw-byte mode is selected, char-class ranges that include
surrogates are rewritten to byte-sequence regex matching CESU-8 encoding
* PropertyMap / ClassRenderer / RangeSet: minor tweaks to support the
surrogate handling path
Validation:
velox_java_pcre2_translator_test: 174/174
velox_regex_compat_test: 445 passed, 4 expected skips
OpenJDK corpus:
Java 860/860 (100.00%)
Pcre2 860/860 (100.00%) ← was 854/860
Re2 757/860 ( 88.02%)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Port 80 active RegExTest bodies to the typed regex-compat suite and add explicit TODO skips for Java-specific or adapter-limited cases. The port records per-backend compatibility rates while preserving Java as the canonical pass requirement. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
After P1.8 added 124 RegExTest cases with 28 GTEST_SKIP entries
(Java-API-only behaviour with no C++ adapter equivalent), the per-
backend tally listener was counting skipped tests as not-passing,
making Java backend appear to drop from 100% to 89% (231/259).
Treat Skipped() as neither pass nor fail. Skipped count is reported
separately for visibility:
JavaRegex (typed) 231 / 231 (100%) [skipped: 28]
Pcre2Regex(typed) 231 / 231 (100%) [skipped: 28]
Re2Regex (typed) 227 / 227 (100%) [skipped: 32]
OpenJDK corpus + RegExTest separate-tally blocks (which use their
own reporters, not GTest pass/fail) are unaffected.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ssible patterns
When a corpus / RegExTest pattern uses features the target engine cannot
support natively (RE2 lookaround, backref, possessive, atomic; PCRE2 has
none in this category — its translator-rejected list is empty), the
translator throws EvaluationFailedException and the backend marks the
pattern as not-ok with an error prefix "Java→RE2 translator: ...".
These are engine ceilings, not implementation bugs.
Both OpenJDK corpus and RegExTest reports now print an additional
"translatable subset" line that excludes such patterns:
OpenJDK corpus:
Re2 757 / 860 (88.02%) [compile-err: 99]
Re2 (translatable subset) 757 / 771 (98.18%) [excludes 89 translator-rejected]
RegExTest ported:
Re2 84 / 96 (87.50%)
Re2 (translatable subset) 84 / 93 (90.32%) [excludes 3 translator-rejected]
Implementation:
- OpenJdkCorpusDiffTest tracks translatorRejected based on the
Re2Regex/Pcre2Regex error_ string prefix.
- RegExTestPortedTest adds a thread_local flag tlsTranslatorRejected
set by a notePatternStatus helper called from every find/match
helper; the PORTED_REGEX_TEST macro resets it before each test
body and recordCase consumes it afterwards.
Java / Pcre2 corpus remain 100.00%; Re2 raw rate unchanged.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…pply pre-commit formatting CI on PR facebookincubator#17685 was failing on: * GCC 14 -Werror=dangling-pointer in ClassBodyParserTest: auto* inter = parse("...").getIf<Intersection>(); // dangling Bound to a named local first instead. * pre-commit: license-header / clang-format / gersemi Ran 'pre-commit run --files ...' over the touched files; only formatting changes (no behavioural diff). Verified: velox_java_pcre2_translator_test: 178/178 passing OpenJDK corpus: Java/Pcre2 860/860, Re2 757/860 (98.18% translatable subset) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
082c3b0 to
d4fdfe6
Compare
mbasmanova
left a comment
There was a problem hiding this comment.
Thank you for the thorough proposal and for addressing each of the historical concerns from #9897 and #9823. The approach is reasonable — PCRE2 with built-in limits is a better answer than Oniguruma or Hyperscan, and the translator improving RE2's corpus pass-rate from 61% to 88% is valuable on its own.
However, 12K lines in a single PR is not reviewable. This contains three independent components that should be separate PRs:
- Java→PCRE2/RE2 translator (
velox/functions/lib/java_pcre2_translator/) — builds unconditionally, benefits RE2 immediately, no new dependencies. - PCRE2 engine wrapper (
velox/external/regex_compat/) — the new engine, gated behindVELOX_ENABLE_PCRE2_BACKEND. - Test harness (
velox/external/regex_compat/tests/) — the JVM-oracle-based test framework.
Items 2 and 3 need broader discussion before we commit to them. Adding a second regex engine is a significant maintenance burden — two code paths to test, debug, and keep compatible. The opt-in CMake flag means CI either tests both paths (doubling regex test matrix) or only tests RE2 (leaving PCRE2 undertested). We'd like to discuss the long-term ownership and maintenance plan before merging the engine itself.
Please split accordingly. The translator PR alone would be a significant contribution and can be reviewed independently. The PCRE2 wrapper and test harness can follow once the translator lands.
For the translator PR: please add documentation (README or header-level docs) explaining what transformations it performs, what it doesn't handle, and the testing strategy. The 178 unit tests are there, but a reader needs to understand what coverage they provide — are these ported 1:1 from pcre4j? Do they cover the transformations that matter most for Spark/Presto patterns?
Also: before the translator PR, we'd like to understand the integration plan. How will existing Velox functions (regexp_extract, regexp_replace, regexp_like, like, Spark split) use the translator and/or PCRE2? Is this a per-function opt-in, a global config, or a per-query setting? The translator library is useful, but the review needs to understand how it connects to the rest of the system.
CI 'Build with GCC / Linux release with adapters' was failing because:
* clang-tidy scans every diff-touched header in isolation; JavaRegex.h /
JvmFixture.h unconditionally include <jni.h>, which is absent on hosts
without a JDK.
* RegExTestPortedTest.cpp's PORTED_REGEX_TEST macro and
OpenJdkCorpusDiffTest.cpp's Grapheme test both reference the JavaRegex
symbol unconditionally; when JAVA_BACKEND is OFF the symbol is not
declared and compilation fails.
Fix:
* Wrap the bodies of JavaRegex.h and JvmFixture.h in
#if VELOX_REGEX_COMPAT_HAS_JAVA so they compile to nothing when the
Java backend is off.
* In RegExTestPortedTest.cpp, gate the 'Java backend regression' EXPECT
in PORTED_REGEX_TEST behind the same macro.
* In OpenJdkCorpusDiffTest.cpp, gate toJString / directJavaGraphemeBreakOffsets
helpers and every 'is_same_v<TypeParam, JavaRegex>' branch behind the
macro.
Verified both build paths locally:
-DVELOX_ENABLE_REGEX_COMPAT_JAVA_BACKEND=ON -> regex_compat_test passes
(Java/Pcre2 860/860, Re2 757/860)
-DVELOX_ENABLE_REGEX_COMPAT_JAVA_BACKEND=OFF -> builds clean
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…_COMPAT_HAS_JAVA
Follow-up to previous commit: the .h files were guarded but the .cpp
files still unconditionally use JNI types (jint, jclass, ...) which
clang-tidy on hosts without a JDK cannot resolve, causing 'Build with
GCC / Linux release with adapters' to fail.
Wrap each .cpp body in #if VELOX_REGEX_COMPAT_HAS_JAVA / #endif so they
compile to an empty translation unit when the Java backend is off.
Verified locally:
JAVA_BACKEND=ON: regex_compat_test passes (Java/Pcre2 860/860,
Re2 757/860 / 98.18% translatable subset)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Proposal
Add PCRE2 as a second regex engine in Velox, enabled at compile time via a new CMake option (default
OFF). The existing RE2 path is unchanged; users that need fullerjava.util.regexsemantics — primarily Spark / Presto front-ends running through Gluten — can opt in at build time.The PR also ships:
Why a second engine, not a replacement
Past discussion ( #9897, #9823 ) repeatedly raised the RE2 vs
java.util.regexgap and equally raised legitimate concerns about every proposed alternative. PCRE2 is qualitatively different from Oniguruma / Hyperscan, and this PR addresses each historical objection directly:Five concerns, one by one
ReDoS / non-linear runtime — @mbasmanova on #9823: "It is not realistic to expect users to 'check the provided patterns carefully'."
PCRE2 has built-in
LIMIT_MATCH/LIMIT_DEPTH/LIMIT_HEAPthat the engine itself enforces. OurPcre2Regexwrapper sets conservative defaults and exposes them asOptionsfields. Hitting a limit surfaces a typed error rather than running forever. This is the same protection Java 9+ added tojava.util.regex. Oniguruma and Hyperscan do not have this.Oniguruma trades reliability for coverage — @mbasmanova on #9897.
We do not ship Oniguruma.
Hyperscan effectively abandoned — @philo-he on #9823.
We do not ship Hyperscan. PCRE2 by contrast is actively maintained (PHP, Apache HTTPD, nginx, git all depend on it; releases roughly every 6 months).
Velox SQL is single-pattern per row; multi-pattern engines don't pay off — @mbasmanova on #9823.
PCRE2 is a single-pattern engine; matches that constraint.
User must opt in if accepting any risk — @FelixYBW on #9823: "In Gluten we use a config to enable re2 offload. Gluten user needs to take the risk if they enable."
New CMake option
VELOX_ENABLE_PCRE2_BACKEND=OFFby default; identical opt-in pattern. Defaultcmake ...adds zero new dependencies.What's in the PR
1. PCRE2 engine wrapper —
velox/external/regex_compat/Pcre2Regex.{h,cpp}Production-quality wrapper around PCRE2 with a
re2::RE2-shaped surface so call sites that opt in can switch with minimal churn:Options.{caseSensitive, dotNl, oneLine, ...}matching Velox's existingRe2Regexoptions.Optionsfields so callers can tune them. Hitting a limit surfaces a typed error, not a hang.toPcre2Patternbeforepcre2_compile, so Java-specific syntax (\p{InGreek},\p{javaXxx},(?U)flag inversion, char-class intersection, …) just works.2. Java→PCRE2/RE2 translator —
velox/functions/lib/java_pcre2_translator/Reusable C++ library that rewrites a
java.util.regexpattern into PCRE2- or RE2-compatible syntax:org.pcre4j.regex.translatemodule from scratch and contributed it upstream); pcre4j does not require a CLA, so I retain copyright on that contribution and am dual-licensing the C++ port here under Apache-2.0. Per-file headers carry the attribution;LICENSE-NOTICE.mddocuments provenance.toRe2Patternthat reuses the same parser/evaluator/property map and additionally rejects features RE2 fundamentally cannot represent (lookaround, backref, possessive, atomic,(?U)semantic inversion) — throwsEvaluationFailedExceptionrather than silently changing semantics.functions/lib/member); production call sites can adopt it independently of the PCRE2 backend.Headline result for RE2: when the existing
Re2Regexis wired through this translator, RE2's OpenJDK 17 corpus pass-rate jumps from 528/860 (61.40%) to 757/860 (88.02%) without changing the engine. That benefit alone — for anyone who wants to stay on RE2 — is the second reason to merge.3. Test harness —
velox/external/regex_compat/tests/Three-backend GTest framework (
Re2Regex,Pcre2Regex,JavaRegexvia embedded JVM as oracle), parameterised viaTYPED_TEST_SUITE. Runs:jdk-17.0.13-ga.RegExTest.java.A
TestMainguard asserts the Java backend remains 100% and loudly warns otherwise — Java isjava.util.regex, so any failure there means our JNI bridge / adapter is wrong, not an engine difference. The harness is gated behindVELOX_ENABLE_REGEX_COMPAT_TESTS=OFFwith a further sub-optionVELOX_ENABLE_REGEX_COMPAT_JAVA_BACKEND=OFFthat auto-disables if no JDK is found.Default-build impact
Zero. With the default
cmake ...invocation:Motivation — gap is documented, recurring, and previously unmeasured
prepareRegexpReplacePattern/prepareRegexpReplaceReplacementintofunctions/libfor Spark reuse. That helper covers one rewrite ((?<name>)→(?P<name>)); everything else (char-class intersection,\p{InGreek},\p{javaXxx},(?U)semantic inversion,\Q\Eedges, …) silently falls on the floor.regexp_replace. Root cause: the one-shot preprocessing introduced in fix(regexp_replace): Move regex preprocessing to functions/lib for Spark reuse and fix backslash handling #10981 had no compatibility regression set behind it. Exactly the class of bug a Java-corpus regression suite catches up front.kMaxCompiledRegexes/ cache-eviction issues. No shared regression corpus existed to validate fixes; this PR provides one.boost::regexwith RE2" migrations; each PR needed ad-hoc per-call-site verification because no shared compatibility harness existed.Validation
* translatable subset= the test set excluding patterns the RE2 translator deliberately rejects as engine-impossible (lookaround, backref, possessive quantifiers, atomic groups,(?U)semantic inversion). RE2's algorithm guarantees linear time and intentionally does not support these constructs.Net takeaway for users that opt in to PCRE2: effective Java parity (100% on the OpenJDK corpus, 97.92% on RegExTest), with PCRE2's own resource-cap protection enforcing predictable runtime.
Net takeaway for users that stay on RE2: the translator alone, applied to RE2, picks up 229 corpus cases (+27 pp) without touching the engine or its linear-time guarantee.
Out of scope
RE2's defaults or itskMaxCompiledRegexesbehaviour. Issues ExpressionFuzzer fails on like: Max number of regex reached #7824 / kMaxCompiledRegexes leads to non-deterministic SparkSQL regexp_replace behavior #8438 / fix: Use EvictingCacheMap for compiled regular expressions #15953 are out of scope; this PR's corpus is the regression baseline a future fix can use.\b{g}(grapheme cluster boundary) andCANON_EQfull canonical-equivalence expansion are PCRE2 engine ceilings; documented in code comments. They are the only meaningful items not at 100% in the PCRE2 column above.File-level overview
Total: ~5.4 kLOC added; 0 lines changed in any existing source file outside CMake wiring.
Licensing
The translator port derives from pcre4j PR alexey-pelykh/pcre4j#606, which I authored (the entire
org.pcre4j.regex.translatemodule was a from-scratch contribution by me to upstream pcre4j). pcre4j does not require a Contributor License Agreement, so copyright on my contribution remains mine, and I am dual-licensing it under Apache-2.0 for inclusion in this Apache-2.0 project.The pcre4j upstream files carry a
Copyright (C) 2024-2026 Oleksii PELYKHheader — that is pcre4j's project-wide convention applied to all merged contributions, not a copyright assignment from contributors. The actual authorship of the translate module is mine and the Apache-2.0 grant here is therefore valid.Per-file attribution headers in this PR retain the pcre4j provenance;
LICENSE-NOTICE.mdin the translator directory documents the full chain.