-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support nested regex and strings in expressions #508
Support nested regex and strings in expressions #508
Conversation
This regex allows Datastar expressions to support nested regex and strings that contain ; and/or \n without breaking. Full regex for testing on regex101.com ``` ((?:\/(?:\\\/|[^\/])*\/|"(?:\\"|[^\"])*"|'(?:\\'|[^'])*'|`(?:\\`|[^`])*`|[^;\n])*)([;\n]+) ``` Only valid ; and \n should be in the second capture group Each of these regex ignore a block type: ``` regex \/(?:\\\/|[^\/])*\/ double quotes "(?:\\"|[^\"])*" single quotes '(?:\\'|[^'])*' ticks `(?:\\`|[^`])*` ``` We want to ignore the non delimiter part of statements too: `[^;\n]` Once all the blocks we want to ignore are captured we then find the statement delimiters in the second capture group: `([;\n]+)` Note: the reason why this regex isn't broken down into strings is that doing so adds an extra layer of escapes which makes it even harder to reason about, harder to check in external regex apps and leads to worse compression. The ignore block model allows us to extend this and validate each component separately. Effectively, an ignore block is a regex that skips the content of the block. For example we want to skip the contents of a `''` block then we need to match the open quote and the close quote and ignore the contents including escaped quotes. That's the theory anyway. You can go mad testing this, just remember the it needs to be valid JS! A few times I thought I'd broken it and it would have broken the JS parser. I don't expect this to be perfect, but I do expect it to be easy to fix any edge cases we find by design (what I'm aiming for with this). There might be a combination of nested strings that break this. I've run the test (sort is still failing but it was before) and I've built and tested the todo app.
I've made a build of this and playing around with is in my own app. To make sure I haven't missed anything. Definitely, give this one a few days. I need to sleep on this and test it some more. But hopefully on the right track. |
Thanks for this valiant effort! Can you please provide a test string that validates the edge-cases that you’ve tested this against? Better yet would be a Codepen so we can test and share edge-cases. |
Actually thinking of adding some traditional unit tests |
I simplified the regex and updated the comment. Split adds all capture groups to the result so we just need one capture group and then filter out '' and ; after trimming. We don't need to filter \n as they get trimmed. Some test code that can be copied into codepen or the browser console var statementSplitRe = /((?:\/(?:\\\/|[^\/])*\/|"(?:\\"|[^\"])*"|'(?:\\'|[^'])*'|`(?:\\`|[^`])*`|[^;\n])*)/gm
// Remember to represent an escaped " in a '' you need to use \\" not \"
var testStringSingle = '["fo; o", "\n", \';\', `"fo ; \' \\\' \\" bam"`]map((x) => x.match(/regex;/); "g\noat;" + y'
testStringSingle.split(statementSplitRe)
.map((s) => s.trim())
.filter((s) => s !== '' && s !== ';')
// Remember to represent an escaped ' in a "" you need to use \\' not \'
var testStringDouble = "['fo; o', \"\\n\", ';',`'fo ; \" \\\" \\' bam'`]map((x) => x.match(/regex;/)); 'g\noat;' + y"
testStringDouble.split(statementSplitRe)
.map((s) => s.trim())
.filter((s) => s !== '' && s !== ';') The key thing is to remember:
Update: Made the test cases clearer. Update: my app is batched and so far so good. |
This is looking great!! I believe we might be able to save some bytes using This appears to have the same result, although my testing setup is surely more primitive than yours, at this stage. const statementSplitRe = /((\/(\\\/|[^\/])*\/|"(\\"|[^\"])*"|'(\\'|[^'])*'|`(\\`|[^`])*`|[^;\n])+)/gm
const stmts = ctx.value.match(statementSplitRe)
const lastIdx = stmts.length - 1
const last = stmts[lastIdx]
if (!last.startsWith('return')) {
stmts[lastIdx] = `return (${stmts[lastIdx]});`
}
let userExpression = stmts.join(';\n').trim() |
Ha, that's brilliant! Completely missed that the way capture groups work in split is basically the same as match. The current regex will still capture white space before a I'll give that a go. |
|
/(?:\/(?:\\\/|[^\/])*\/|"(?:\\"|[^\"])*"|'(?:\\'|[^'])*'|`(?:\\`|[^`])*`|[^;\n])+/gm This regex allows Datastar expressions to support nested regex and strings that contain ; and/or \n without breaking. Each of these regex defines a block type we want to capture: regex \/(?:\\\/|[^\/])*\/ double quotes "(?:\\"|[^\"])*" single quotes '(?:\\'|[^'])*' ticks `(?:\\`|[^`])*` We want to capture the non delimiter part of statements too: [^;\n] The regex above will not work in regex101.com as javascript regex handles single and double quotes for us. The test cases bellow can be pasted into the developer console to check the regex. Note we need the `trim`, before we `match`. ``` var statementRe = /(?:\/(?:\\\/|[^\/])*\/|"(?:\\"|[^\"])*"|'(?:\\'|[^'])*'|`(?:\\`|[^`])*`|[^;\n])+/gm var testStringSingle = '["foo", "\n", \';\', `"fo ; \' \\\' \\" bam"`]map((x) => x.match(/regex;/); "goat;" + y' testStringSingle.match(statementRe) var testStringDouble = "['foo', \"\\n\", ';',`'fo ; \" \\\" \\' bam'`]map((x) => x.match(/regex;/)); 'goat;' + y" testStringDouble.match(statementRe) // trailing space after a colon breaks this so we need to add trim var testStringDoubleTrailingSpace = "['foo', \"\\n\", ';',`'fo ; \" \\\" \\' bam'`]map((x) => x.match(/regex;/)); 'goat;' + y; " testStringDoubleTrailingSpace.match(statementRe) testStringDoubleTrailingSpace.trim().match(statementRe) ``` p
So I've found one more edge case:
I removed code that the engine doesn't care about. What the new code is (slight modification of what ben posted): const stmts = ctx.value.trim().match(statementRe)
const lastIdx = stmts.length - 1
const last = stmts[lastIdx]
if (!last.startsWith('return')) {
stmts[lastIdx] = `return (${last});`
}
let userExpression = stmts.join(';') Not sure if you squash merge, but if you do the last commit has the latest info in the commit message for use in blames (if that's something you care about). Copy of that message here if we just use the GitHub PR ref. Support nested regex and strings in expressions #508
This regex allows Datastar expressions to support nested Each of these regex defines a block type we want to capture:
We want to capture the non delimiter part of statements too:
The regex above will not work in regex101.com as javascript The test cases bellow can be pasted into the developer console to Note we need the var statementRe = /(?:\/(?:\\\/|[^\/])*\/|"(?:\\"|[^\"])*"|'(?:\\'|[^'])*'|`(?:\\`|[^`])*`|[^;\n])+/gm
var testStringSingle = '["foo", "\n", \';\', `"fo ; \' \\\' \\" bam"`]map((x) => x.match(/regex;/); "goat;" + y'
testStringSingle.match(statementRe)
var testStringDouble = "['foo', \"\\n\", ';',`'fo ; \" \\\" \\' bam'`]map((x) => x.match(/regex;/)); 'goat;' + y"
testStringDouble.match(statementRe)
// trailing space after a colon breaks this so we need to add trim
var testStringDoubleTrailingSpace = "['foo', \"\\n\", ';',`'fo ; \" \\\" \\' bam'`]map((x) => x.match(/regex;/)); 'goat;' + y; "
testStringDoubleTrailingSpace.match(statementRe)
testStringDoubleTrailingSpace.trim().match(statementRe) |
So are the non-capturing groups required, after all? |
@bencroker So removing the non capture group works. Makes the code easier to read. But compressed to a larger bundle: ![]() Green numbers are with the non capture groups red numbers are without. I imagine gzip already has a token for Still I think the readability improvement is a win. |
library/src/engine/engine.ts
Outdated
} | ||
let userExpression = stmts.join(';\n').trim() | ||
let userExpression = stmts.join(';') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you ELI5 why we don't put the newlines in? Especially for debugger errors from expanded expression
Merged, amazing work @andersmurphy, thanks! |
This regex allows Datastar expressions to support nested regex and strings that contain ; and/or \n without breaking.
Full regex for testing on regex101.com
Only valid ; and \n should be in the second capture group Each of these regex ignore a block type:
We want to ignore the non delimiter part of statements too:
[^;\n]
Once all the blocks we want to ignore are captured we then find the statement delimiters in the second capture group:
([;\n]+)
Note: the reason why this regex isn't broken down into strings is that doing so adds an extra layer of escapes which makes it even harder to reason about, harder to check in external regex apps and leads to worse compression.
The ignore block model allows us to extend this and validate each component separately.
Effectively, an ignore block is a regex that skips the content of the block. For example we want to skip the contents of a
''
block then we need to match the open quote and the close quote and ignore the contents including escaped quotes.That's the theory anyway. You can go mad testing this, just remember the it needs to be valid JS! A few times I thought I'd broken it and it would have broken the JS parser.
I don't expect this to be perfect, but I do expect it to be easy to fix any edge cases we find by design (what I'm aiming for with this).
There might be a combination of nested strings that break this.
I've run the test (sort is still failing but it was before) and I've built and tested the todo app.