YAPG sounds hilarious, but yes, this is yet another parse-tables generator.
YAPG (if you'd call it that) takes a specification for an LR(1) grammar as input and generates a C++ header file containing action and goto tables, alongside helper structs for writing a parser for the grammar from the tables.
While working on a compiler to LLVM IR, I had to deal with a pretty common compiler enginnering dilemma: hand-written or generated parsers. Arguing for the former, the parsers for a lot of the more successful languages are hand-written (clang, rust, gcc). The flexibility it provides for handling complex grammars is a significant advantage. Conversely? Well, generated parsers are insanely cool to me. I think that's more than enough reason.
Now, I'm a sucker for "figuring out how things work" so of course I decided to write my own parser generator. Following three weeks of studying as much theory on parsing as I could find, alongside pretty consistent programming, I was thankfully successful.
I don't expect this to be a big thing (whatever that means). It's just something I had a lot of fun working on. Plus, I'd include a less-than formal description of table-driven LR(1) parsing and table generation in a seperate blog post. So much of the existing material exploring the topic can be "tough" looking.
The following commands clone the repository and compile the parser-generator:
$ git clone [email protected]:wldfngrs/parser-generator.git
$ cd parser-generator
$ mkdir build
$ cd build
$ cmake .. -G Ninja -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_COMPILER=clang++
$ ninja parsegen
If the build is successful, you should have an executable; parsegen
, in the build/
directory.
$ ./parsegen <path/to/grammar> [OPTIONAL] <path/to/output/file>
$ ./parsegen -h or ./parsegen -H for help information
The parser generator expects two arguments:
- path to file containing specification of the LR(1) grammar,
- path to file where the tables should be written to.
If an output file is not provided explicitly, the parser-generator defaults to "output.h"
within the build/
directory.
Either of the help options; -h
, -H
, print the grammar specification syntax to standard output.
Note: RHS
('Right Hand Side'), LHS
('Left Hand Side')
The input grammar specification should follow these base rules:
- Terminals must be prefixed by 't_'. Each terminal definition line starting this way.
- The first line in the terminal definition block should be the goal production lookahead terminal.
- Terminals must be defined first, in a contiguous block without interleaving empty lines between individual definitions.
- The first occuring empty line signifies the end of the terminal definition block, and the start of the rules/productions block.
- The first line in the rules/productions block should be the goal production.
- Terminals (symbols prefixed by 't_') cannot appear on the RHS of a rule/production.
- The RHS and LHS of a production must be delimited by ' > '. Note: [SPACE] [GREATER_THAN] [SPACE]
Of course, you'd receive helpful error and diagnostic messages if any of these rules are ignored.
For a simple, yet robust example, take a 'parentheses' grammar that expects a right-parenthesis symbol )
for every left-parenthesis symbol (
(in that order). For example, ()
is valid but )(
is not.
Here's a correct grammar specification for such a 'parentheses' grammar:
t_EOF
t_LP
t_RP
Goal > List
List > List Pair
List > Pair
Pair > t_LP Pair t_RP
Pair > t_LP t_RP
And here's the output file generated by the parser generator, for the parentheses grammar:
parentheses/parse-tables.h
#pragma once
#include <unordered_map>
#include <unordered_set>
#include <string>
#include <string_view>
#include <vector>
std::unordered_set<std::string> strings {
"Goal", "Pair", "List"
};
enum TokenType {
t_RP, t_EOF, t_LP
};
std::vector<std::pair<std::string_view, size_t>> reduce_info {
{ *strings.find("Goal"), 1 }, { *strings.find("Pair"), 3 },
{ *strings.find("List"), 2 }, { *strings.find("List"), 1 },
{ *strings.find("Pair"), 2 }
};
struct PairHash {
size_t operator()(const std::pair<size_t, std::string_view>& pair) const {
return std::hash<size_t>{}(pair.first) ^ std::hash<std::string_view>{}(pair.second);
}
size_t operator()(const std::pair<size_t, TokenType>& pair) const {
return std::hash<size_t>{}(pair.first) ^ pair.second;
}
};
enum ActionType {
SHIFT,
REDUCE,
ACCEPT
};
struct Action {
ActionType type;
size_t value;
};
std::unordered_map<std::pair<size_t, TokenType>, Action, PairHash> actionTable {
{{ 6, t_EOF }, {REDUCE, 4 }}, {{ 0, t_LP }, {SHIFT, 2 }},
{{ 8, t_RP }, {SHIFT, 10 }}, {{ 6, t_LP }, {REDUCE, 4 }},
{{ 11, t_LP }, {REDUCE, 1 }}, {{ 5, t_RP }, {SHIFT, 11 }},
{{ 11, t_EOF }, {REDUCE, 1 }}, {{ 4, t_EOF }, {REDUCE, 2 }},
{{ 10, t_RP }, {REDUCE, 1 }}, {{ 4, t_LP }, {REDUCE, 2 }},
{{ 7, t_LP }, {SHIFT, 7 }}, {{ 9, t_RP }, {REDUCE, 4 }},
{{ 3, t_EOF }, {REDUCE, 3 }}, {{ 3, t_LP }, {REDUCE, 3 }},
{{ 1, t_EOF }, {ACCEPT, 0 }}, {{ 1, t_LP }, {SHIFT, 2 }},
{{ 2, t_RP }, {SHIFT, 6 }}, {{ 2, t_LP }, {SHIFT, 7 }},
{{ 7, t_RP }, {SHIFT, 9 }}
};
std::unordered_map<std::pair<size_t, std::string_view>, size_t, PairHash> gotoTable {
{{ 0, *strings.find("List") }, {1}}, {{ 0, *strings.find("Pair") }, {3}},
{{ 1, *strings.find("Pair") }, {4}}, {{ 2, *strings.find("Pair") }, {5}},
{{ 7, *strings.find("Pair") }, {8}}
};
The generated actionTable is an std::unordered_map
that takes as 'key' an std::pair
of the parse function's current state and the next token, to return as 'value' an Action
struct object.
It defines the action to be taken; SHIFT
, REDUCE
, ACCEPT
, given a current state and a next token.
The Action
struct contains two fields: type (SHIFT
, REDUCE
, ACCEPT
) and value. For each type, the value field is interpreted differently:
Note: PF
('parse function'), NT
('next token')
-
SHIFT
: Given thePF
current state, andNT
, value represents thePF
next state. -
REDUCE
: Given thePF
current state, andNT
, value represents an index to reduce_info.reduce_info is an
std::vector
ofstd::pair
elements. For eachstd::pair
element, the first field is the non-terminal symbol to reduce to, and the second field is the number of states to pop off thePF
state stack.This non-terminal symbol information can be used for creating Abstract Syntax Tree nodes during the parse.
-
ACCEPT
: Given thePF
current state, andNT
, value is redundant. This is an accepting state, signifiying a validated input.
The generated gotoTable is an std::unordered_map
that takes as 'key' an std::pair
of the parse function's current state and the LHS
production symbol to be reduced to, to return as 'value' the PF
next state.
It defines the next state given a current reduce non-terminal symbol and a next token.
These tables and helper structs can then be referenced within the parser's parse()
function.
Here's the parse()
function for the above specified parentheses grammar:
parentheses/parentheses.h
static bool parse(std::vector<TokenType> tokens) {
std::stack<size_t> states;
states.push(0);
auto state = static_cast<size_t>(0);
for (auto i = 0; i < tokens.size();) {
state = states.top();
if (actionTable.find(std::make_pair(state, tokens[i])) == actionTable.end()) {
return false;
}
Action& action = actionTable[std::make_pair(state, tokens[i])];
if (action.type == ActionType::REDUCE) {
auto index = actionTable[std::make_pair(state, tokens[i])].value;
auto pop_count = reduce_info[index].second;
while (pop_count--) {
states.pop();
}
state = states.top();
auto reduce_symbol = reduce_info[index].first;
auto next_state = gotoTable[std::make_pair(state, reduce_symbol)];
states.push(next_state);
}
else if (action.type == ActionType::SHIFT) {
auto next_state = actionTable[std::make_pair(state, tokens[i])].value;
states.push(next_state);
// Only SHIFT actions proceed to the next token
++i;
}
else if (action.type == ActionType::ACCEPT) {
return true;
}
}
}
For those familiar with table-generated parsers, the algorithm is easy to follow. For those who aren't, the parse function takes a list of tokens ordered by their chronological position in the input source file, iterating through them one by one.
Implicitly assuming an initial state of 0
, the parse function progresses to subsequent states based on the 'action' mapped to the current state and the next token in the chronologically-ordered list of tokens.
If the pair of current state and next token do not exist in the generated actionTable, the parse function has entered an invalid state, and should return false signifiying a failed parse.
Else, the parse function retrieves the 'action' mapped to that pair, and executes the corresponding behaviour tied to that 'action':
SHIFT
action: push the next state to the state stack. Move to the next token.REDUCE
action: pop x states off the state stack. Where x isreduce_info[reduce-action-value].second
. Refer to parentheses parse function.ACCEPT
action: return true.
The grammar specification syntax allows you explicitly specify precedence and associativity behaviour of terminals. This is commonly useful in evaluating mathematical expressions.
As an example, here's the grammar specification of a simple 'mathematical expressions' grammar:
t_EOF
t_PLUS 1 l
t_MINUS 1 l
t_TIMES 2 l
t_DIVIDE 2 l
t_NUMBER 4
t_LP
t_RP
Statement > Expression
Expression > t_NUMBER
Expression > Grouping
Expression > Add
Expression > Sub
Expression > Mul
Expression > Div
Expression > Unary
Grouping > t_LP Expression t_RP 2
Add > Expression t_PLUS Expression
Sub > Expression t_MINUS Expression
Mul > Expression t_TIMES Expression
Div > Expression t_DIVIDE Expression
Unary > t_MINUS Expression 3
In the terminal definition block,
t_EOF
t_PLUS 1 l
t_MINUS 1 l
t_TIMES 2 l
t_DIVIDE 2 l
t_NUMBER 4
t_LP
t_RP
Following the terminal name definition, on the same line, the terminal's precedence value and associativity behavior can be explicitly set.
Precedence values must be integers (positive or negative). If not explicitly set, the terminal precedence value defaults to 0
. The higher the precedence value set, the higher the terminal precedence.
Associativity behaviour must be one of l
(left-associative), r
(right-associative), or n
(non-associative). If not explicitly set, the associativity behavior defaults to n
(non-associative).
In the above terminal definition block, the t_PLUS
(+
) and t_MINUS
(-
) terminals (binary operators for addition and subtraction respectively) have the same precedence value; 1
, and associativity behavior; l
(left-associative), specified.
However, t_TIMES
(*
) and t_DIVIDE
(/
) (binary operators for multiplication and division) both have a higher precedence value; 2
. Multiplication and division, as a result, would have a higher precedence than addition and substraction (as expected).
Numbers, t_NUMBER
, the most important node of any mathematical expression, has the highest precedence value set 4
.
The precedence of a rule/production can be set by specifying an integer value at the end of a rule/production. For example,
Unary > t_MINUS Expression 3
Note that the Unary
rule expects the same terminal as a Sub
rule; t_MINUS
. However, it's standard mathematical understanding that Unary operations are evaluated before any other binary operations.
Explicit rule precedence forces the parser to somewhat override the already set precedence value of t_MINUS
, treating it as an implicit Unary terminal instead (within the Unary
rule/production) with the explicitly set precedence value.
With table-driven parsers, two kinds of conflicts potentially arise in the table-generation process: SHIFT-REDUCE
and REDUCE-REDUCE
conflicts.
Consider an expression, 2 + 2 * 4
. This could be evaluated in two different ways.
On the one hand, the addition can be evaluated first, 2 + 2
, and the result evaluated against the * 4
, like so; (2 + 2) * 4
. This would produce 16
as output, which is incorrect.
Fortunately, we understand that the precedence of mathematical operators demands that multiplication operations be evaluated before addition operations. That is, 2 + (2 * 4)
, to produce 10
as output.
How to encode this behavior in the parser-generator? Precedence and associativity rules, of course.
Simply put, a SHIFT-REDUCE
conflict occurs when both a SHIFT
and REDUCE
action would produce a valid parse function next state. These conflicts are resolved by applying these conditions in order:
Let rule
be a rule subject to reduce and term
be a terminal/token that is encountered on input.
- If explicit
rule
precedence is bigger thanterm
precedence, perform aREDUCE
. - If precedence of last terminal in
rule
is bigger thanterm
precedence, perform aREDUCE
. - If precedence of last terminal in
rule
is equal toterm
precedence and last terminal inrule
is left-associative, perform aREDUCE
- Otherwise, perform a
SHIFT
.
In some cases, the language is ill-formed and the grammar specification on REDUCE
actions is unclear. That is, more than one rules/productions have the same LHS
.
This is a fatal error that results in the parser generator terminating early with an error message identifying the rules/productions leading to ambiguous REDUCE
actions.
Rather than a bunch of carefully hand-picked test-cases, this repository includes two REPL interpreters for both the 'mathematical expressions' grammar and the 'parentheses' grammar.
Run these commands within the build/
directory to build and then run the 'mathematical expressions' interpreter:
$ ninja expressions
$ ./expressions
Math Expressions Evaluator ('q' or CTRL-C to exit)
>
You can now verify the correctness of the parser's expression evaluation order of fundamental (*
, +
, -
, \
, Unary, Brackets) mathematical operations.
Run these commands within the build/
directory to build and then run the 'parentheses' interpreter:
$ ninja parentheses
$ ./parentheses
Parentheses Grammar Interpreter (enter 'q' or CTRL-C to exit)
>
Similarly, you can verify the correctness of a string of parentheses; that is, every left parenthesis (
has a matching right parenthesis )
in the correct order (i.e. balanced and well-formed).