Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lua lexer #3

Open
jisbruzzi opened this issue Jan 17, 2022 · 1 comment
Open

Lua lexer #3

jisbruzzi opened this issue Jan 17, 2022 · 1 comment
Labels
example User-provided demo

Comments

@jisbruzzi
Copy link

jisbruzzi commented Jan 17, 2022

Thank you so much for leac, its API is very elegant and easy to use.

I'm creating the new assignments for the compilers class at the college of engineering UBA (University of Buenos Aires). The students might use the following lua lexer which I want to share with you in case you find it useful. If I find the time I'll write some more tests and add an example in a PR.

import { createLexer, Rule, Rules } from "https://deno.land/x/leac/leac.ts";

const symbols = [
  ";",
  "=",
  ",",
  "::",
  ".",
  "[",
  "]",
  "...",
  "(",
  ")",
  ":",
  "{",
  "}",
];

const keywords = [
  "break",
  "goto",
  "do",
  "end",
  "while",
  "do",
  "repeat",
  "until",
  "if",
  "then",
  "elseif",
  "else",
  "for",
  "in",
  "function",
  "local",
  "return",
  "nil",
  "false",
  "true",
];

const ops = [
  "+",
  "-",
  "*",
  "/",
  "//",
  "^",
  "%",
  "&",
  "~",
  "|",
  ">>",
  "<<",
  "..",
  "<",
  "<=",
  ">",
  ">=",
  "==",
  "~=",
  "and",
  "or",
  "#",
  "not",
];
const doublequoteStringLexer = createLexer([
  {
    name: "stringContent",
    regex: /(?:\\["abfnrtv/\\nz]|\\u[a-fA-F0-9]{4}|[^"\\\n])*/,
  },
  {
    name: "LiteralStringEnd",
    str: '"',
    pop: true,
    discard: true,
  },
]);
const singlequoteStringLexer = createLexer([
  {
    name: "stringContent",
    regex: /(?:\\['abfnrtv/\\nz]|\\u[a-fA-F0-9]{4}|[^'\\\n])*/,
  },
  {
    name: "LiteralStringEnd",
    str: "'",
    pop: true,
    discard: true,
  },
]);
function smallerCaseRegexpPart(level: number) {
  if (level == 0) {
    return "";
  }
  if (level == 1) {
    return "|\\]\\]";
  }
  return `|\\]={0,${level - 1}}(?=\\])`;
}
function regexNoLongBrackets(level: number) {
  const smallerCase = smallerCaseRegexpPart(level);
  const largerCase = `|\\]={${level + 1},}(?=\\])`;
  const doesntEndWithBracketCase = `|\\]=*[^\\]=]`;
  return new RegExp(
    `([^\\]]${smallerCase}${largerCase}${doesntEndWithBracketCase})+`,
    "m",
  );
}
function longLiteralStringRule(level: number) {
  const equalSigns = Array(level).fill("=").join("");
  const lexer = createLexer([
    {
      name: "LiteralStringEnd",
      str: "]" + equalSigns + "]",
      discard: true,
      pop: true,
    },
    {
      name: "stringContent",
      regex: regexNoLongBrackets(level),
      discard: true,
    },
  ]);
  return {
    name: "LiteralStringBegin",
    str: "[" + equalSigns + "[",
    push: lexer,
    discard: true,
  };
}
const simpleRules = [...ops, ...keywords, ...symbols].map((v) => ({ name: v }));
function longCommentRule(level: number) {
  const equalSigns = Array(level).fill("=").join("");
  const lexer = createLexer([
    {
      name: "longCommentEnd",
      str: "]" + equalSigns + "]",
      discard: true,
      pop: true,
    },
    {
      name: "commentContent",
      regex: regexNoLongBrackets(level),
      discard: true,
    },
  ]);
  return {
    name: "longCommentBegin",
    str: "--[" + equalSigns + "[",
    push: lexer,
    discard: true,
  };
}
export const lex = createLexer(
  [
    {
      name: "ws",
      regex: /\s+/,
      discard: true,
    },
    ...Array(100).fill(0).map((_value, index) => longCommentRule(index)),
    {
      name: "shortComment",
      regex: /--.*\n/m,
      discard:true,
    },
    ...simpleRules,
    {
      name: "Name",
      regex: /[a-zA-Z_][a-zA-Z_0-9]*/,
    },
    {
      name: "Numeral",
      regex: /[0-9]*\.?[0-9]+/,
    },
    {
      name: "LiteralStringBegin",
      str: '"',
      push: doublequoteStringLexer,
      discard: true,
    },
    {
      name: "LiteralStringBegin",
      str: '"',
      push: singlequoteStringLexer,
      discard: true,
    },
    ...Array(100).fill(0).map((_value, index) => longLiteralStringRule(index)),
  ],
);
@KillyMXI
Copy link
Member

KillyMXI commented Jan 17, 2022

Wow! I'm happy this package finding it's use outside of my own projects already!


Thank you for this example.
I think it illustrates one caveat I should've paid more attention myself. I will update the docs and json example in a few days to raise the awareness (Upd 2022-01-28: docs updated).

In short, try to tokenize a code with empty strings.
It is not obvious that emitting a token for an empty regex match is tricky for a tokenizer. My solution for this is to not emit any token for empty match (otherwise it would require to mitigate infinite loops somehow).
In combination with discarded quotes this can lead to a bad stream of tokens.

In my opinion, it is only safe to discard things that won't be missed when you try to glue the tokens array back and it will still be valid code. Otherwise be ready to think about edge cases.

  • When composing lexers, it is safer to keep quotes. You will identify an empty string in a parser as two quote tokens in a row;
  • Another option might be to introduce an empty string rule in an outer scope, before the opening quote rule;
  • For simple strings it might be feasible to keep the string content and quotes in a single regex, using replace property to extract the content;
  • Single regex as above but leave quotes in place - this can actually lead to a cleaner design.

In theory, it might be possible to keep track of rules that returned an empty match and return at most a single token at current offset. This will complicate the implementation and I'm not sure it's a good idea.

Related note: you produce stringContent tokens in multiple places with slightly different regexes. This means strings might contain differently escaped quotes and there is not enough information in the token to distinguish them and decide on a proper unescape function later in a parser. Using unique names might be a good idea in this case.

Another note:

const simpleRules = [...ops, ...keywords, ...symbols].map((v) => ({ name: v }));

This is ultimately a designer's choice but, if putting an effort to keep things separate to this point, I'd think whether keeping this information in rules and tokens could be beneficial for the parser as well, like this:

const opRules = ops.map((v) => ({ name: 'Operation', str: v }));
const keywordRules = keywords.map((v) => ({ name: 'Keyword', str: v }));
const symbolRules = symbols.map((v) => ({ name: 'Symbol', str: v }));

Rule names don't have to be unique. It is very similar to making a single regex rule that will match any of listed values.


For a PR, example should be self-contained, like the existing ones. Just call the lexer with minimal sample(s) to illustrate variety of supported grammar and log to console. Project tests run all examples and store the output as a snapshot.

I'm more concerned about comments. Examples must be that - examples for those who are learning how to use the package. Link(s) to official grammar is a must for an existing language. And any comments to describe the lexer design. The bigger the grammar - the more explanation it might need.

(I was quite lazy about samples and comments because my examples essentially are excerpts from peberminta examples. Big samples would also lead to unnecessarily big snapshots.)

@KillyMXI KillyMXI added the example User-provided demo label Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
example User-provided demo
Projects
None yet
Development

No branches or pull requests

2 participants