moonbit | ||||||
---|---|---|---|---|---|---|
|
β οΈ API STABILITY NOTICE
This is a alpha release with a stabilizing API. While core functionality is complete and well-tested,
API changes may occur in future versions as we refine the implementation.
Regular expression engine for MoonBit β inspired by Russ Cox's regex series.
test {
// Compile once, use everywhere
let regexp = @regexp.compile("a(bc|de)f")
guard regexp.match_("xxabcf") is Some(result)
inspect(
result.results(),
content=(
#|[Some("abcf"), Some("bc")]
),
)
// Write a simple split with regexp
fn split(
regexp : @regexp.Regexp,
target : @string.View
) -> Array[@string.View] {
let result = []
loop target {
"" => ()
str => {
let res = regexp.execute(str)
result.push(res.before())
continue res.after()
}
}
result
}
let re = @regexp.compile("_+")
inspect(
split(re, "1_2__3__4__5_____6"),
content=(
#|["1", "2", "3", "4", "5", "6"]
),
)
}
compile(pattern)
β Creates anEngine
engine.execute(text)
β ReturnsMatchResult
result.matched()
βBool
result.get(index)
β Capture group contentresult.results()
β Iterator over all matches
engine.group_by_name(name)
β Find group index by nameengine.group_count()
β Total capture groupsresult.groups()
β Get named group content
Feature | Example | What it does |
---|---|---|
Literals | abc |
Match exact text |
Wildcards | a.c |
. matches any character |
Quantifiers | a+ , b* , c? |
One or more, zero or more, optional |
Ranges | a{2,5} |
Between 2-5 repetitions |
Classes | [a-z] , [^0-9] |
Character sets, negated sets |
Groups | (abc) , (?:xyz) |
Capturing, non-capturing |
Named | (?<word>abc) |
Named capture groups |
Choice | cat|dog |
Match either option |
Anchors | ^start , end$ |
Line boundaries |
Escapes | \\u{41} , \\u0041 |
Unicode escapes, standard escapes |
Unicode Props | \\p{L} , \\p{Nd} |
Unicode general categories |
Backrefs |
(.)\\1 |
Reference previous captures |
Match characters by their Unicode general categories:
test "unicode properties" {
// Matching gc=L
let regex = @regexp.compile("\\p{Letter}+")
inspect(
regex.execute("Hello δΈη").results(),
content=(
#|[Some("Hello")]
),
)
// Matching gc=N
let regex = @regexp.compile("\\p{Number}+")
inspect(
regex.execute("123 and 456").results(),
content=(
#|[Some("123")]
),
)
}
Supported Propertes:
β οΈ Performance Warning: Backreferences can cause exponential time complexity in worst cases!
test "backreferences" {
// Palindrome detection (simple)
let palindrome = @regexp.compile("^(.)(.)\\2\\1")
inspect(
palindrome.execute("abba").results(),
content=(
#|[Some("abba"), Some("a"), Some("b")]
),
)
// HTML tag matching
let html_regex = @regexp.compile("<([a-zA-Z]+)[^>]*>(.*?)</\\1>")
let result = html_regex.execute("<div class='test'>content</div>")
inspect(
result.results(),
content=(
#|[Some("<div class='test'>content</div>"), Some("div"), Some("content")]
),
)
}
test "character classes" {
// Email validation (simplified)
let email = @regexp.compile(
(
#|[\w-]+@[\w-]+\.\w+
),
)
let email_result = email.execute("[email protected]").results()
inspect(
email_result,
content=(
#|[Some("[email protected]")]
),
)
// Extract numbers
let numbers = @regexp.compile(
(
#|\d+\.\d{2}
),
)
let result = numbers.execute("Price: $42.99").results()
inspect(
result,
content=(
#|[Some("42.99")]
),
)
// Named captures for parsing
let parser = @regexp.compile(
(
#|(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})
),
)
let date_result = parser.execute("2024-03-15")
inspect(
date_result.groups(),
content=(
#|{"year": "2024", "month": "03", "day": "15"}
),
)
}
test {
try {
let _ = @regexp.compile("a(b") // Oops! Missing )
} catch {
RegexpError(err=MissingParenthesis, source_fragment=_) => println("Fix your regex! π§")
_ => ()
}
}
- Predictable complexity β Designed to avoid catastrophic backtracking (except with backreferences)
- VM-based β Structured interpreter design
- Unicode support β Character set and property support
Built with reliability and correctness as primary goals.
This implementation has some behavior differences compared to other popular regex engines:
-
Empty Character Class Handling:
- In JavaScript:
[][]
is parsed as two character classes with no characters - In Golang:
[][]
is parsed as one character class containing]
and[
- In MoonBit: we follow the JavaScript interpretation
- In JavaScript:
-
Empty Alternatives Behavior:
- Expressions like
(|a)*
and(|a)+
have specific behavior that may differ from other implementations - See Golang issue #46123 for related discussion
- Expressions like
-
Backreferences:
- Backreferences are supported but may impact the complexity guarantees of the engine