Skip to content

Conversation

@hkimura-intersys
Copy link
Collaborator

@hkimura-intersys hkimura-intersys commented Oct 21, 2025

This PR contains changes made by both @davem-intersys and I to fix the following:

ISSUES FIXED:

  1. The decimal_literal logic led to infinite looping, which made the core grammar not work for tree-sitter playground. This is now fixed.
  2. The dotted statements were allowed to match from any point of the line, now they can only be at the start of a line.
  3. After the decimal_literal was fixed, the DFA kind of exploded. This led to an out of memory error on tree-sitter build --wasm for the udl grammar. This was fixed via some of the changes detailed below.
  4. Added debug logging, and added logic to handle EOF in scanner.
  5. The if statements weren't parsing correctly in the core grammar.

CHANGES :

Dotted Statements: added logic so this can only be matched at the start of a line
◦ Since these should only be matched if they are at the start of the line, I moved this logic to the external scanner. I used an external token that fires at BOL and consumes any leading dots, then hand control back to the normal statement grammar. Also makes it more simple because the external scanner eats the leading . characters, so . can safely belong to numeric literals elsewhere without colliding with dotted indentation.
decimal_literal/numeric_literal/integer_literal: fix infinite loop of decimal
◦ Removed decimal_literal and integer_literal, and defined the whole thing as a numeric_literal. This is probably not necessary, but it was done in an effort to lower the number of lexer states
◦ The old decimal_literal broke because it created a zero-width token: when there was no exponent, token.immediate() matched empty string, so it succeeded without advancing causing an infinite loop.
pattern_operator/pattern_expression: adjusting this led to the largest decrease in lexer states, and made it so the tree-sitter build --wasm worked again (before was running into memory issues)
◦ switched from token() to token.immediate(). Adjusted the regex so there is still whitespace allowed before/after the ? ( I just allowed \t \n), this seems to work. ( tested this out with spaces and newlines in between)
scanner.c (serialize/deserialize):
◦ added logic to serialize so the scanner is restorable, so incremental parsing works as expected (will be used in language server impl)
◦ added logic to deserialize that protects against corrupt/older serialized blobs and ensures the scanner never reads past the provided buffer.
• Dave added debug logging and added logic to handle EOF after a single space. Additionally, he added logic to treat EOF as argumentless
• Combined common regex into constants (IDENT_SEG, DOTTED_ID_STRICT, DOTTED_ID_RELAXED) that could be referenced from common/identifiers.js for easier readability. Then, made rules that could be reused (and are referenced many times)
• added a new token to the scanner so the spaces parsed correctly for if statements. It was matching the wrong token before, leading to errors.
• Lastly, I changed a few definitions in the grammars. Mostly to make rules more strict.

TESTING

  1. I tested that with these changes, both the core and udl grammars work for tree-sitter build --wasm and tree-sitter playground
  2. Additionally, I tested that pattern_expression still works with whitespace (spaces between ? and newlines):
image
  1. I tested that dotted_statement still works
image
  1. tested each grammar with tree-sitter test:
image

@hkimura-intersys hkimura-intersys self-assigned this Oct 21, 2025
@gjsjohnmurray
Copy link
Contributor

gjsjohnmurray commented Oct 22, 2025

I'm hoping this will resolve #8 and #9

@hkimura-intersys
Copy link
Collaborator Author

I'm hoping this will resolve #8 and #9

With this, running tree-sitter test doesn't get stuck, and tree-sitter playground works as well in core.

@hkimura-intersys hkimura-intersys changed the title fix: fix dotted_statement definition, pattern_expression, decimal_literal, tighten grammar rules, add logging and EOF logic to scanner fix: fix dotted_statement definition, pattern_expression, decimal_literal, adjust grammar rules, logging, scanner updates, test updates Oct 23, 2025
@hkimura-intersys hkimura-intersys changed the title fix: fix dotted_statement definition, pattern_expression, decimal_literal, adjust grammar rules, logging, scanner updates, test updates fix: update dotted_statement definition, pattern_expression, decimal_literal, adjust grammar rules, logging, scanner updates, test updates Oct 23, 2025
@davem-intersys davem-intersys merged commit 68005c6 into intersystems:main Oct 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants