Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support nested regex and strings in expressions #508

Conversation

andersmurphy
Copy link
Contributor

This regex allows Datastar expressions to support nested regex and strings that contain ; and/or \n without breaking.

Full regex for testing on regex101.com

((?:\/(?:\\\/|[^\/])*\/|"(?:\\"|[^\"])*"|'(?:\\'|[^'])*'|`(?:\\`|[^`])*`|[^;\n])*)([;\n]+)

Only valid ; and \n should be in the second capture group Each of these regex ignore a block type:

regex            \/(?:\\\/|[^\/])*\/
double quotes     "(?:\\"|[^\"])*"
single quotes     '(?:\\'|[^'])*'
ticks             `(?:\\`|[^`])*`

We want to ignore the non delimiter part of statements too:

[^;\n]

Once all the blocks we want to ignore are captured we then find the statement delimiters in the second capture group:

([;\n]+)

Note: the reason why this regex isn't broken down into strings is that doing so adds an extra layer of escapes which makes it even harder to reason about, harder to check in external regex apps and leads to worse compression.

The ignore block model allows us to extend this and validate each component separately.

Effectively, an ignore block is a regex that skips the content of the block. For example we want to skip the contents of a '' block then we need to match the open quote and the close quote and ignore the contents including escaped quotes.

That's the theory anyway. You can go mad testing this, just remember the it needs to be valid JS! A few times I thought I'd broken it and it would have broken the JS parser.

I don't expect this to be perfect, but I do expect it to be easy to fix any edge cases we find by design (what I'm aiming for with this).

There might be a combination of nested strings that break this.

I've run the test (sort is still failing but it was before) and I've built and tested the todo app.

This regex allows Datastar expressions to support nested
regex and strings that contain ; and/or \n without breaking.

Full regex for testing on regex101.com

```
((?:\/(?:\\\/|[^\/])*\/|"(?:\\"|[^\"])*"|'(?:\\'|[^'])*'|`(?:\\`|[^`])*`|[^;\n])*)([;\n]+)
```

Only valid ; and \n should be in the second capture group Each of these regex ignore a block type:

```
regex            \/(?:\\\/|[^\/])*\/
double quotes     "(?:\\"|[^\"])*"
single quotes     '(?:\\'|[^'])*'
ticks             `(?:\\`|[^`])*`
```
We want to ignore the non delimiter part of statements too:

`[^;\n]`

Once all the blocks we want to ignore are captured we then find
the statement delimiters in the second capture group:

`([;\n]+)`

Note: the reason why this regex isn't broken down into strings is that
doing so adds an extra layer of escapes which makes it even harder to
reason about, harder to check in external regex apps and leads to
worse compression.

The ignore block model allows us to extend this and validate each
component separately.

Effectively, an ignore block is a regex that skips the content of the
block. For example we want to skip the contents of a `''` block then
we need to match the open quote and the close quote and ignore the
contents including escaped quotes.

That's the theory anyway. You can go mad testing this, just remember
the it needs to be valid JS! A few times I thought I'd broken
it and it would have broken the JS parser.

I don't expect this to be perfect, but I do expect it to be easy to
fix any edge cases we find by design (what I'm aiming for with this).

There might be a combination of nested strings that break this.

I've run the test (sort is still failing but it was before) and
I've built and tested the todo app.
@andersmurphy
Copy link
Contributor Author

andersmurphy commented Jan 21, 2025

I've made a build of this and playing around with is in my own app. To make sure I haven't missed anything.

Definitely, give this one a few days. I need to sleep on this and test it some more. But hopefully on the right track.

@bencroker
Copy link
Collaborator

Thanks for this valiant effort! Can you please provide a test string that validates the edge-cases that you’ve tested this against? Better yet would be a Codepen so we can test and share edge-cases.

@andersmurphy
Copy link
Contributor Author

Thanks for this valiant effort! Can you please provide a test string that validates the edge-cases that you’ve tested this against? Better yet would be a Codepen so we can test and share edge-cases.

Actually thinking of adding some traditional unit tests

@andersmurphy
Copy link
Contributor Author

andersmurphy commented Jan 22, 2025

I simplified the regex and updated the comment. Split adds all capture groups to the result so we just need one capture group and then filter out '' and ; after trimming. We don't need to filter \n as they get trimmed.

Some test code that can be copied into codepen or the browser console

var statementSplitRe = /((?:\/(?:\\\/|[^\/])*\/|"(?:\\"|[^\"])*"|'(?:\\'|[^'])*'|`(?:\\`|[^`])*`|[^;\n])*)/gm

// Remember to represent an escaped " in a '' you need to use \\" not \"

var testStringSingle = '["fo; o", "\n", \';\', `"fo ; \' \\\' \\" bam"`]map((x) => x.match(/regex;/); "g\noat;" + y'

testStringSingle.split(statementSplitRe)
          .map((s) => s.trim())
          .filter((s) => s !== '' && s !== ';')

// Remember to represent an escaped ' in a "" you need to use \\' not \'

var testStringDouble = "['fo; o', \"\\n\", ';',`'fo ; \" \\\" \\' bam'`]map((x) => x.match(/regex;/)); 'g\noat;' + y"

testStringDouble.split(statementSplitRe)
          .map((s) => s.trim())
  .filter((s) => s !== '' && s !== ';') 

The key thing is to remember:

  • you are either embedded in a ' string or a " string but javascript regex handles them the same. Html doesn't support backtick but under the hood they get represented as either ' or " strings.
  • your string needs to be valid Javascript

Update: Made the test cases clearer.

Update: my app is batched and so far so good.

@bencroker
Copy link
Collaborator

This is looking great!! I believe we might be able to save some bytes using + at the end and using .match instead of .split, as well as removing the non-capturing groups.

This appears to have the same result, although my testing setup is surely more primitive than yours, at this stage.

    const statementSplitRe = /((\/(\\\/|[^\/])*\/|"(\\"|[^\"])*"|'(\\'|[^'])*'|`(\\`|[^`])*`|[^;\n])+)/gm

    const stmts = ctx.value.match(statementSplitRe)
    const lastIdx = stmts.length - 1
    const last = stmts[lastIdx]
    if (!last.startsWith('return')) {
      stmts[lastIdx] = `return (${stmts[lastIdx]});`
    }
    let userExpression = stmts.join(';\n').trim()

@andersmurphy
Copy link
Contributor Author

This is looking great!! I believe we might be able to save some bytes using + at the end and using .match instead of .split, as well as removing the non-capturing groups.

This appears to have the same result, although my testing setup is surely more primitive than yours, at this stage.

    const statementSplitRe = /((\/(\\\/|[^\/])*\/|"(\\"|[^\"])*"|'(\\'|[^'])*'|`(\\`|[^`])*`|[^;\n])+)/gm

    const stmts = ctx.value.match(statementSplitRe)
    const lastIdx = stmts.length - 1
    const last = stmts[lastIdx]
    if (!last.startsWith('return')) {
      stmts[lastIdx] = `return (${stmts[lastIdx]});`
    }
    let userExpression = stmts.join(';\n').trim()

Ha, that's brilliant! Completely missed that the way capture groups work in split is basically the same as match.

The current regex will still capture white space before a ;\n so I'll need to test if that breaks the subsequent engine code. That would just mean adding the maptrim back though so still a net improvement over split.

I'll give that a go.

@bencroker
Copy link
Collaborator

The current regex will still capture white space before a ;\n so I'll need to test if that breaks the subsequent engine code.

$foo = 1 ; is a valid statement.

/(?:\/(?:\\\/|[^\/])*\/|"(?:\\"|[^\"])*"|'(?:\\'|[^'])*'|`(?:\\`|[^`])*`|[^;\n])+/gm

This regex allows Datastar expressions to support nested
regex and strings that contain ; and/or \n without breaking.

Each of these regex defines a block type we want to capture:

regex            \/(?:\\\/|[^\/])*\/
double quotes     "(?:\\"|[^\"])*"
single quotes     '(?:\\'|[^'])*'
ticks             `(?:\\`|[^`])*`

We want to capture the non delimiter part of statements too:

[^;\n]

The regex above will not work in regex101.com as javascript
regex handles single and double quotes for us.

The test cases bellow can be pasted into the developer console to
check the regex.

Note we need the `trim`, before we `match`.

```
var statementRe = /(?:\/(?:\\\/|[^\/])*\/|"(?:\\"|[^\"])*"|'(?:\\'|[^'])*'|`(?:\\`|[^`])*`|[^;\n])+/gm

var testStringSingle = '["foo", "\n", \';\', `"fo ; \' \\\' \\" bam"`]map((x) => x.match(/regex;/); "goat;" + y'

testStringSingle.match(statementRe)

var testStringDouble = "['foo', \"\\n\", ';',`'fo ; \" \\\" \\' bam'`]map((x) => x.match(/regex;/)); 'goat;' + y"

testStringDouble.match(statementRe)

// trailing space after a colon breaks this so we need to add trim
var testStringDoubleTrailingSpace = "['foo', \"\\n\", ';',`'fo ; \" \\\" \\' bam'`]map((x) => x.match(/regex;/)); 'goat;' + y;   "

testStringDoubleTrailingSpace.match(statementRe)

testStringDoubleTrailingSpace.trim().match(statementRe)
```

p
@andersmurphy
Copy link
Contributor Author

andersmurphy commented Jan 22, 2025

So I've found one more edge case:

  • we need to trim before we match if we want to handle users putting white space after the last ; in the string

I removed code that the engine doesn't care about.

What the new code is (slight modification of what ben posted):

const stmts = ctx.value.trim().match(statementRe)
const lastIdx = stmts.length - 1
const last = stmts[lastIdx]
if (!last.startsWith('return')) {
  stmts[lastIdx] = `return (${last});`
}
let userExpression = stmts.join(';')

Not sure if you squash merge, but if you do the last commit has the latest info in the commit message for use in blames (if that's something you care about). Copy of that message here if we just use the GitHub PR ref.

Support nested regex and strings in expressions #508

/(?:\/(?:\\\/|[^\/])*\/|"(?:\\"|[^\"])*"|'(?:\\'|[^'])*'|`(?:\\`|[^`])*`|[^;\n])+/gm

This regex allows Datastar expressions to support nested
regex and strings that contain ; and/or \n without breaking.

Each of these regex defines a block type we want to capture:

regex            \/(?:\\\/|[^\/])*\/
double quotes     "(?:\\"|[^\"])*"
single quotes     '(?:\\'|[^'])*'
ticks             `(?:\\`|[^`])*`

We want to capture the non delimiter part of statements too:

[^;\n]

The regex above will not work in regex101.com as javascript
regex handles single and double quotes for us.

The test cases bellow can be pasted into the developer console to
check the regex.

Note we need the trim, before we match.

var statementRe = /(?:\/(?:\\\/|[^\/])*\/|"(?:\\"|[^\"])*"|'(?:\\'|[^'])*'|`(?:\\`|[^`])*`|[^;\n])+/gm

var testStringSingle = '["foo", "\n", \';\', `"fo ; \' \\\' \\" bam"`]map((x) => x.match(/regex;/); "goat;" + y'

testStringSingle.match(statementRe)

var testStringDouble = "['foo', \"\\n\", ';',`'fo ; \" \\\" \\' bam'`]map((x) => x.match(/regex;/)); 'goat;' + y"

testStringDouble.match(statementRe)

// trailing space after a colon breaks this so we need to add trim
var testStringDoubleTrailingSpace = "['foo', \"\\n\", ';',`'fo ; \" \\\" \\' bam'`]map((x) => x.match(/regex;/)); 'goat;' + y;   "

testStringDoubleTrailingSpace.match(statementRe)

testStringDoubleTrailingSpace.trim().match(statementRe)

@bencroker
Copy link
Collaborator

So are the non-capturing groups required, after all?

@andersmurphy
Copy link
Contributor Author

andersmurphy commented Jan 23, 2025

@bencroker So removing the non capture group works. Makes the code easier to read. But compressed to a larger bundle:

Screenshot 2025-01-23 at 11 04 37

Green numbers are with the non capture groups red numbers are without. I imagine gzip already has a token for (?: or there's some other sharing.

Still I think the readability improvement is a win.

}
let userExpression = stmts.join(';\n').trim()
let userExpression = stmts.join(';')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you ELI5 why we don't put the newlines in? Especially for debugger errors from expanded expression

@bencroker bencroker merged commit 451b62e into starfederation:develop Jan 23, 2025
1 check failed
@bencroker
Copy link
Collaborator

Merged, amazing work @andersmurphy, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants