fix: update dotted_statement definition, pattern_expression, decimal_literal, adjust grammar rules, logging, scanner updates, test updates #13
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR contains changes made by both @davem-intersys and I to fix the following:
ISSUES FIXED:
decimal_literallogic led to infinite looping, which made the core grammar not work for tree-sitter playground. This is now fixed.dotted statementswere allowed to match from any point of the line, now they can only be at the start of a line.CHANGES :
• Dotted Statements: added logic so this can only be matched at the start of a line
◦ Since these should only be matched if they are at the start of the line, I moved this logic to the external scanner. I used an external token that fires at BOL and consumes any leading dots, then hand control back to the normal statement grammar. Also makes it more simple because the external scanner eats the leading . characters, so . can safely belong to numeric literals elsewhere without colliding with dotted indentation.
• decimal_literal/numeric_literal/integer_literal: fix infinite loop of decimal
◦ Removed decimal_literal and integer_literal, and defined the whole thing as a numeric_literal. This is probably not necessary, but it was done in an effort to lower the number of lexer states
◦ The old
decimal_literalbroke because it created a zero-width token: when there was no exponent, token.immediate() matched empty string, so it succeeded without advancing causing an infinite loop.• pattern_operator/pattern_expression: adjusting this led to the largest decrease in lexer states, and made it so the tree-sitter build --wasm worked again (before was running into memory issues)
◦ switched from
token()totoken.immediate(). Adjusted the regex so there is still whitespace allowed before/after the ? ( I just allowed \t \n), this seems to work. ( tested this out with spaces and newlines in between)• scanner.c (serialize/deserialize):
◦ added logic to serialize so the scanner is restorable, so incremental parsing works as expected (will be used in language server impl)
◦ added logic to deserialize that protects against corrupt/older serialized blobs and ensures the scanner never reads past the provided buffer.
• Dave added debug logging and added logic to handle EOF after a single space. Additionally, he added logic to treat EOF as argumentless
• Combined common regex into constants (IDENT_SEG, DOTTED_ID_STRICT, DOTTED_ID_RELAXED) that could be referenced from common/identifiers.js for easier readability. Then, made rules that could be reused (and are referenced many times)
• added a new token to the scanner so the spaces parsed correctly for if statements. It was matching the wrong token before, leading to errors.
• Lastly, I changed a few definitions in the grammars. Mostly to make rules more strict.
TESTING
tree-sitter build --wasmandtree-sitter playgroundtree-sitter test: