Support nested regex and strings in expressions #508

andersmurphy · 2025-01-21T17:55:47Z

This regex allows Datastar expressions to support nested regex and strings that contain ; and/or \n without breaking.

Full regex for testing on regex101.com

((?:\/(?:\\\/|[^\/])*\/|"(?:\\"|[^\"])*"|'(?:\\'|[^'])*'|`(?:\\`|[^`])*`|[^;\n])*)([;\n]+)

Only valid ; and \n should be in the second capture group Each of these regex ignore a block type:

regex            \/(?:\\\/|[^\/])*\/
double quotes     "(?:\\"|[^\"])*"
single quotes     '(?:\\'|[^'])*'
ticks             `(?:\\`|[^`])*`

We want to ignore the non delimiter part of statements too:

[^;\n]

Once all the blocks we want to ignore are captured we then find the statement delimiters in the second capture group:

([;\n]+)

Note: the reason why this regex isn't broken down into strings is that doing so adds an extra layer of escapes which makes it even harder to reason about, harder to check in external regex apps and leads to worse compression.

The ignore block model allows us to extend this and validate each component separately.

Effectively, an ignore block is a regex that skips the content of the block. For example we want to skip the contents of a '' block then we need to match the open quote and the close quote and ignore the contents including escaped quotes.

That's the theory anyway. You can go mad testing this, just remember the it needs to be valid JS! A few times I thought I'd broken it and it would have broken the JS parser.

I don't expect this to be perfect, but I do expect it to be easy to fix any edge cases we find by design (what I'm aiming for with this).

There might be a combination of nested strings that break this.

I've run the test (sort is still failing but it was before) and I've built and tested the todo app.

This regex allows Datastar expressions to support nested regex and strings that contain ; and/or \n without breaking. Full regex for testing on regex101.com ``` ((?:\/(?:\\\/|[^\/])*\/|"(?:\\"|[^\"])*"|'(?:\\'|[^'])*'|`(?:\\`|[^`])*`|[^;\n])*)([;\n]+) ``` Only valid ; and \n should be in the second capture group Each of these regex ignore a block type: ``` regex \/(?:\\\/|[^\/])*\/ double quotes "(?:\\"|[^\"])*" single quotes '(?:\\'|[^'])*' ticks `(?:\\`|[^`])*` ``` We want to ignore the non delimiter part of statements too: `[^;\n]` Once all the blocks we want to ignore are captured we then find the statement delimiters in the second capture group: `([;\n]+)` Note: the reason why this regex isn't broken down into strings is that doing so adds an extra layer of escapes which makes it even harder to reason about, harder to check in external regex apps and leads to worse compression. The ignore block model allows us to extend this and validate each component separately. Effectively, an ignore block is a regex that skips the content of the block. For example we want to skip the contents of a `''` block then we need to match the open quote and the close quote and ignore the contents including escaped quotes. That's the theory anyway. You can go mad testing this, just remember the it needs to be valid JS! A few times I thought I'd broken it and it would have broken the JS parser. I don't expect this to be perfect, but I do expect it to be easy to fix any edge cases we find by design (what I'm aiming for with this). There might be a combination of nested strings that break this. I've run the test (sort is still failing but it was before) and I've built and tested the todo app.

andersmurphy · 2025-01-21T17:57:24Z

I've made a build of this and playing around with is in my own app. To make sure I haven't missed anything.

Definitely, give this one a few days. I need to sleep on this and test it some more. But hopefully on the right track.

bencroker · 2025-01-21T18:34:14Z

Thanks for this valiant effort! Can you please provide a test string that validates the edge-cases that you’ve tested this against? Better yet would be a Codepen so we can test and share edge-cases.

andersmurphy · 2025-01-21T19:04:41Z

Thanks for this valiant effort! Can you please provide a test string that validates the edge-cases that you’ve tested this against? Better yet would be a Codepen so we can test and share edge-cases.

Actually thinking of adding some traditional unit tests

andersmurphy · 2025-01-22T12:46:24Z

I simplified the regex and updated the comment. Split adds all capture groups to the result so we just need one capture group and then filter out '' and ; after trimming. We don't need to filter \n as they get trimmed.

Some test code that can be copied into codepen or the browser console

var statementSplitRe = /((?:\/(?:\\\/|[^\/])*\/|"(?:\\"|[^\"])*"|'(?:\\'|[^'])*'|`(?:\\`|[^`])*`|[^;\n])*)/gm

// Remember to represent an escaped " in a '' you need to use \\" not \"

var testStringSingle = '["fo; o", "\n", \';\', `"fo ; \' \\\' \\" bam"`]map((x) => x.match(/regex;/); "g\noat;" + y'

testStringSingle.split(statementSplitRe)
          .map((s) => s.trim())
          .filter((s) => s !== '' && s !== ';')

// Remember to represent an escaped ' in a "" you need to use \\' not \'

var testStringDouble = "['fo; o', \"\\n\", ';',`'fo ; \" \\\" \\' bam'`]map((x) => x.match(/regex;/)); 'g\noat;' + y"

testStringDouble.split(statementSplitRe)
          .map((s) => s.trim())
  .filter((s) => s !== '' && s !== ';')

The key thing is to remember:

you are either embedded in a ' string or a " string but javascript regex handles them the same. Html doesn't support backtick but under the hood they get represented as either ' or " strings.
your string needs to be valid Javascript

Update: Made the test cases clearer.

Update: my app is batched and so far so good.

bencroker · 2025-01-22T14:30:36Z

This is looking great!! I believe we might be able to save some bytes using + at the end and using .match instead of .split, as well as removing the non-capturing groups.

This appears to have the same result, although my testing setup is surely more primitive than yours, at this stage.

    const statementSplitRe = /((\/(\\\/|[^\/])*\/|"(\\"|[^\"])*"|'(\\'|[^'])*'|`(\\`|[^`])*`|[^;\n])+)/gm

    const stmts = ctx.value.match(statementSplitRe)
    const lastIdx = stmts.length - 1
    const last = stmts[lastIdx]
    if (!last.startsWith('return')) {
      stmts[lastIdx] = `return (${stmts[lastIdx]});`
    }
    let userExpression = stmts.join(';\n').trim()

andersmurphy · 2025-01-22T14:37:08Z

This is looking great!! I believe we might be able to save some bytes using + at the end and using .match instead of .split, as well as removing the non-capturing groups.

This appears to have the same result, although my testing setup is surely more primitive than yours, at this stage.
    const statementSplitRe = /((\/(\\\/|[^\/])*\/|"(\\"|[^\"])*"|'(\\'|[^'])*'|`(\\`|[^`])*`|[^;\n])+)/gm

    const stmts = ctx.value.match(statementSplitRe)
    const lastIdx = stmts.length - 1
    const last = stmts[lastIdx]
    if (!last.startsWith('return')) {
      stmts[lastIdx] = `return (${stmts[lastIdx]});`
    }
    let userExpression = stmts.join(';\n').trim()

Ha, that's brilliant! Completely missed that the way capture groups work in split is basically the same as match.

The current regex will still capture white space before a ;\n so I'll need to test if that breaks the subsequent engine code. That would just mean adding the maptrim back though so still a net improvement over split.

I'll give that a go.

bencroker · 2025-01-22T14:47:03Z

The current regex will still capture white space before a ;\n so I'll need to test if that breaks the subsequent engine code.

$foo = 1 ; is a valid statement.

/(?:\/(?:\\\/|[^\/])*\/|"(?:\\"|[^\"])*"|'(?:\\'|[^'])*'|`(?:\\`|[^`])*`|[^;\n])+/gm This regex allows Datastar expressions to support nested regex and strings that contain ; and/or \n without breaking. Each of these regex defines a block type we want to capture: regex \/(?:\\\/|[^\/])*\/ double quotes "(?:\\"|[^\"])*" single quotes '(?:\\'|[^'])*' ticks `(?:\\`|[^`])*` We want to capture the non delimiter part of statements too: [^;\n] The regex above will not work in regex101.com as javascript regex handles single and double quotes for us. The test cases bellow can be pasted into the developer console to check the regex. Note we need the `trim`, before we `match`. ``` var statementRe = /(?:\/(?:\\\/|[^\/])*\/|"(?:\\"|[^\"])*"|'(?:\\'|[^'])*'|`(?:\\`|[^`])*`|[^;\n])+/gm var testStringSingle = '["foo", "\n", \';\', `"fo ; \' \\\' \\" bam"`]map((x) => x.match(/regex;/); "goat;" + y' testStringSingle.match(statementRe) var testStringDouble = "['foo', \"\\n\", ';',`'fo ; \" \\\" \\' bam'`]map((x) => x.match(/regex;/)); 'goat;' + y" testStringDouble.match(statementRe) // trailing space after a colon breaks this so we need to add trim var testStringDoubleTrailingSpace = "['foo', \"\\n\", ';',`'fo ; \" \\\" \\' bam'`]map((x) => x.match(/regex;/)); 'goat;' + y; " testStringDoubleTrailingSpace.match(statementRe) testStringDoubleTrailingSpace.trim().match(statementRe) ``` p

andersmurphy · 2025-01-22T19:32:38Z

So I've found one more edge case:

we need to trim before we match if we want to handle users putting white space after the last ; in the string

I removed code that the engine doesn't care about.

What the new code is (slight modification of what ben posted):

const stmts = ctx.value.trim().match(statementRe)
const lastIdx = stmts.length - 1
const last = stmts[lastIdx]
if (!last.startsWith('return')) {
  stmts[lastIdx] = `return (${last});`
}
let userExpression = stmts.join(';')

Not sure if you squash merge, but if you do the last commit has the latest info in the commit message for use in blames (if that's something you care about). Copy of that message here if we just use the GitHub PR ref.

Support nested regex and strings in expressions #508

/(?:\/(?:\\\/|[^\/])*\/|"(?:\\"|[^\"])*"|'(?:\\'|[^'])*'|`(?:\\`|[^`])*`|[^;\n])+/gm

This regex allows Datastar expressions to support nested
regex and strings that contain ; and/or \n without breaking.

Each of these regex defines a block type we want to capture:

regex            \/(?:\\\/|[^\/])*\/
double quotes     "(?:\\"|[^\"])*"
single quotes     '(?:\\'|[^'])*'
ticks             `(?:\\`|[^`])*`

We want to capture the non delimiter part of statements too:

[^;\n]

The regex above will not work in regex101.com as javascript
regex handles single and double quotes for us.

The test cases bellow can be pasted into the developer console to
check the regex.

Note we need the trim, before we match.

var statementRe = /(?:\/(?:\\\/|[^\/])*\/|"(?:\\"|[^\"])*"|'(?:\\'|[^'])*'|`(?:\\`|[^`])*`|[^;\n])+/gm

var testStringSingle = '["foo", "\n", \';\', `"fo ; \' \\\' \\" bam"`]map((x) => x.match(/regex;/); "goat;" + y'

testStringSingle.match(statementRe)

var testStringDouble = "['foo', \"\\n\", ';',`'fo ; \" \\\" \\' bam'`]map((x) => x.match(/regex;/)); 'goat;' + y"

testStringDouble.match(statementRe)

// trailing space after a colon breaks this so we need to add trim
var testStringDoubleTrailingSpace = "['foo', \"\\n\", ';',`'fo ; \" \\\" \\' bam'`]map((x) => x.match(/regex;/)); 'goat;' + y;   "

testStringDoubleTrailingSpace.match(statementRe)

testStringDoubleTrailingSpace.trim().match(statementRe)

bencroker · 2025-01-22T22:23:18Z

So are the non-capturing groups required, after all?

…ssions

andersmurphy · 2025-01-23T11:07:48Z

@bencroker So removing the non capture group works. Makes the code easier to read. But compressed to a larger bundle:

Green numbers are with the non capture groups red numbers are without. I imagine gzip already has a token for (?: or there's some other sharing.

Still I think the readability improvement is a win.

delaneyj · 2025-01-22T23:28:29Z

library/src/engine/engine.ts

    }
-    let userExpression = stmts.join(';\n').trim()
+    let userExpression = stmts.join(';')


Can you ELI5 why we don't put the newlines in? Especially for debugger errors from expanded expression

bencroker · 2025-01-23T14:12:09Z

Merged, amazing work @andersmurphy, thanks!

andersmurphy requested review from delaneyj and bencroker as code owners January 21, 2025 17:55

Simplify regex

4f06f85

Fix typo

ec05899

delaneyj and others added 3 commits January 22, 2025 16:54

build the bundle

ec262d0

Removed the ignore capture groups as they are not needed with match

6ffe3c7

Merge branch 'develop' into support-nested-regex-and-strings-in-expre…

d568632

…ssions

Use newline as separator for more readable error messages

74798d3

delaneyj approved these changes Jan 23, 2025

View reviewed changes

bencroker merged commit 451b62e into starfederation:develop Jan 23, 2025
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Support nested regex and strings in expressions #508

Support nested regex and strings in expressions #508

Uh oh!

andersmurphy commented Jan 21, 2025

Uh oh!

andersmurphy commented Jan 21, 2025 •

edited

Loading

Uh oh!

bencroker commented Jan 21, 2025

Uh oh!

andersmurphy commented Jan 21, 2025

Uh oh!

andersmurphy commented Jan 22, 2025 •

edited

Loading

Uh oh!

bencroker commented Jan 22, 2025

Uh oh!

andersmurphy commented Jan 22, 2025

Uh oh!

bencroker commented Jan 22, 2025

Uh oh!

andersmurphy commented Jan 22, 2025 •

edited

Loading

Uh oh!

bencroker commented Jan 22, 2025

Uh oh!

andersmurphy commented Jan 23, 2025 •

edited

Loading

Uh oh!

delaneyj Jan 22, 2025

Uh oh!

Uh oh!

bencroker commented Jan 23, 2025

Uh oh!

Uh oh!

Uh oh!

Support nested regex and strings in expressions #508

Support nested regex and strings in expressions #508

Uh oh!

Conversation

andersmurphy commented Jan 21, 2025

Uh oh!

andersmurphy commented Jan 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bencroker commented Jan 21, 2025

Uh oh!

andersmurphy commented Jan 21, 2025

Uh oh!

andersmurphy commented Jan 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bencroker commented Jan 22, 2025

Uh oh!

andersmurphy commented Jan 22, 2025

Uh oh!

bencroker commented Jan 22, 2025

Uh oh!

andersmurphy commented Jan 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Support nested regex and strings in expressions #508

Uh oh!

bencroker commented Jan 22, 2025

Uh oh!

andersmurphy commented Jan 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

delaneyj Jan 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bencroker commented Jan 23, 2025

Uh oh!

Uh oh!

andersmurphy commented Jan 21, 2025 •

edited

Loading

andersmurphy commented Jan 22, 2025 •

edited

Loading

andersmurphy commented Jan 22, 2025 •

edited

Loading

andersmurphy commented Jan 23, 2025 •

edited

Loading