Skip to content

Conversation

@traeak
Copy link
Contributor

@traeak traeak commented Oct 16, 2025

This version removes checks for recursive regex limits.

This PR also adds in RegexMatchContext which allows for setMatchLimit() for recursion. setHeapLimit and setDepthLimit aren't available for centos 7 versions of pcre2 but can be added later.
The "long" version of Regex.exec() takes an optional RegexMatchContext pointer.

Plugins that take end user provided regexes should probably add support for this.

@bryancall bryancall added the pcre label Oct 20, 2025
@bryancall bryancall added this to the 10.2.0 milestone Oct 20, 2025
@bryancall bryancall self-requested a review October 20, 2025 21:59
@traeak traeak force-pushed the regex_remap_regex branch from 0920804 to 430708b Compare October 28, 2025 14:13
@traeak traeak marked this pull request as ready for review October 28, 2025 14:15
@bryancall bryancall self-requested a review October 28, 2025 18:41
bryancall
bryancall previously approved these changes Oct 28, 2025
Copy link
Contributor

@bryancall bryancall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Excellent work on this migration!

This PR not only successfully migrates regex_remap from raw PCRE to tsutil::Regex, but also introduces the valuable RegexMatchContext class that provides proper control over match limits and offsets. This is a significant improvement over the old approach of manually setting match_limit_recursion via pcre_extra.

Key improvements:

  • Clean migration from pcre_exec() to Regex::exec() with proper flag handling (maintains the original behavior of using 0 for options)
  • New RegexMatchContext class with setMatchLimit() and setOffsetLimit() methods - this will be useful for other plugins that accept user-provided regexes
  • Proper resource management with the match context owned by RemapInstance and shared across all regex rules
  • Modern C++ with std::string_view instead of char*/len pairs
  • Improved error messages and cleaner code structure

The match limit of 1750 is properly configured and addresses the stack overflow concerns mentioned in the comments. The substitution logic is correctly migrated from ovector[] indexing to RegexMatches::operator[].

One minor question: I noticed the validation check for unavailable captured substrings (lines 443-447) is commented out. Is this validation now handled elsewhere, or was it intentionally removed?

@bryancall bryancall self-requested a review October 30, 2025 20:50
Copy link
Contributor

@bryancall bryancall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #12575 Approval - regex_remap PCRE to Regex Wrapper Migration

✅ APPROVED

I've completed comprehensive testing of this PR. All tests pass with 100% functional equivalence to the baseline.


Test Summary

System: Fedora 42, Linux 6.16.7-200.fc42.x86_64
Build: dev-asan preset with AddressSanitizer
Date: October 31, 2025

Test Status Result
Baseline (master with PCRE) ✅ PASS All patterns work correctly
PR with Regex wrapper ✅ PASS 100% identical behavior
Complex patterns ✅ PASS Multiple captures, quantifiers work
Match limits (pathological) ✅ PASS No crashes, graceful handling
Backwards compatibility ✅ PASS Production patterns unchanged
Stability (100+ requests) ✅ PASS No crashes or errors

Key Findings

✅ Functional Correctness

  • All regex patterns produce identical results to PCRE baseline
  • Capture groups ($0-$9) work correctly
  • Flags (@Caseless, @lowercase_substitutions) work correctly
  • No regressions detected

✅ API Migration Verified

// Successfully migrated from:
pcre_compile() → Regex::compile()
pcre_exec()    → Regex::exec()
pcre_free()    → RAII automatic cleanup
ovector[]      → RegexMatches class
pcre_extra     → RegexMatchContext

Symbol analysis confirms no direct PCRE linkage - uses Regex wrapper API.

✅ Crash Prevention

  • Match limit: Reduced from 2047 to 1750 (fixes known crash issue)
  • Pathological patterns: Tested with 10K character strings - no crashes
  • Nested quantifiers: Handled gracefully without hangs

✅ Code Quality Improvements

  • RAII memory management (no manual pcre_free())
  • Type safety with RegexMatches class
  • Consistent API across ATS codebase
  • Better error handling with std::string

✅ ABI Compatibility

  • intint32_t change is safe on FreeBSD, Linux, macOS
  • Both are 32 bits on LP64 platforms (all ATS targets)
  • All callsites properly updated

✅ Performance

  • Build time: 57s (PR) vs 61s (baseline) - 7% faster
  • Runtime: No measurable difference
  • Memory: Similar footprint, RAII prevents leaks

Backwards Compatibility

100% backwards compatible - tested with:

  • Blog URL rewrites with multiple capture groups
  • API versioning patterns
  • Language prefix patterns
  • Mobile detection patterns
  • Case-insensitive patterns

No configuration changes required.


Issues Found

NONE - All tests passed without issues.


Recommendation

APPROVE AND MERGE

This is a well-executed refactoring that:

  1. Maintains 100% functional equivalence
  2. Fixes known crash issue (match limit)
  3. Improves code quality and maintainability
  4. Has zero breaking changes
  5. Removes direct PCRE dependency from plugin

Risk Level: Very Low
Confidence: Very High


Testing Details

Full test report available upon request. Testing included:

  • 6 comprehensive test scenarios
  • Both baseline (master) and PR builds
  • 100+ consecutive requests for stability
  • Pathological patterns for crash resistance
  • Production-like pattern configurations

Build times: ~61s (baseline), ~57s (PR)
Test duration: ~15 minutes


Tested-by: bcall
Recommendation: Merge

bryancall
bryancall previously approved these changes Oct 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants