Yet Another Parser-Generator (YAPG)

YAPG sounds hilarious, but yes, this is yet another parse-tables generator.

YAPG (if you'd call it that) takes a specification for an LR(1) grammar as input and generates a C++ header file containing action and goto tables, alongside helper structs for writing a parser for the grammar from the tables.

Why?

While working on a compiler to LLVM IR, I had to deal with a pretty common compiler enginnering dilemma: hand-written or generated parsers. Arguing for the former, the parsers for a lot of the more successful languages are hand-written (clang, rust, gcc). The flexibility it provides for handling complex grammars is a significant advantage. Conversely? Well, generated parsers are insanely cool to me. I think that's more than enough reason.

Now, I'm a sucker for "figuring out how things work" so of course I decided to write my own parser generator. Following three weeks of studying as much theory on parsing as I could find, alongside pretty consistent programming, I was thankfully successful.

I don't expect this to be a big thing (whatever that means). It's just something I had a lot of fun working on. Plus, I'd include a less-than formal description of table-driven LR(1) parsing and table generation in a seperate blog post. So much of the existing material exploring the topic can be "tough" looking.

Installation

The following commands clone the repository and compile the parser-generator:

$ git clone git@github.com:wldfngrs/parser-generator.git
$ cd parser-generator
$ mkdir build
$ cd build
$ cmake .. -G Ninja -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_COMPILER=clang++
$ ninja parsegen

If the build is successful, you should have an executable; parsegen, in the build/ directory.

Usage

$ ./parsegen <path/to/grammar> [OPTIONAL] <path/to/output/file>
$ ./parsegen -h or ./parsegen -H for help information

The parser generator expects two arguments:

path to file containing specification of the LR(1) grammar,
path to file where the tables should be written to.

If an output file is not provided explicitly, the parser-generator defaults to "output.h" within the build/ directory.

Either of the help options; -h, -H, print the grammar specification syntax to standard output.

LR(1) Grammar Specification Syntax

Note: RHS ('Right Hand Side'), LHS ('Left Hand Side')

The input grammar specification should follow these base rules:

Terminals must be prefixed by 't_'. Each terminal definition line starting this way.
The first line in the terminal definition block should be the goal production lookahead terminal.
Terminals must be defined first, in a contiguous block without interleaving empty lines between individual definitions.
The first occuring empty line signifies the end of the terminal definition block, and the start of the rules/productions block.
The first line in the rules/productions block should be the goal production.
Terminals (symbols prefixed by 't_') cannot appear on the RHS of a rule/production.
The RHS and LHS of a production must be delimited by ' > '. Note: [SPACE] [GREATER_THAN] [SPACE]

Of course, you'd receive helpful error and diagnostic messages if any of these rules are ignored.

Parentheses grammar

For a simple, yet robust example, take a 'parentheses' grammar that expects a right-parenthesis symbol ) for every left-parenthesis symbol ( (in that order). For example, () is valid but )( is not.

Here's a correct grammar specification for such a 'parentheses' grammar:

t_EOF
t_LP
t_RP

Goal > List
List > List Pair
List > Pair
Pair > t_LP Pair t_RP
Pair > t_LP t_RP

And here's the output file generated by the parser generator, for the parentheses grammar:

parentheses/parse-tables.h

#pragma once

#include <unordered_map>
#include <unordered_set>
#include <string>
#include <string_view>
#include <vector>

std::unordered_set<std::string> strings {
	"Goal", "Pair", "List"
};

enum TokenType {
	t_RP, t_EOF, t_LP
};

std::vector<std::pair<std::string_view, size_t>> reduce_info {
	{ *strings.find("Goal"), 1 }, { *strings.find("Pair"), 3 },
	{ *strings.find("List"), 2 }, { *strings.find("List"), 1 },
	{ *strings.find("Pair"), 2 }
};

struct PairHash {
	size_t operator()(const std::pair<size_t, std::string_view>& pair) const {
		return std::hash<size_t>{}(pair.first) ^ std::hash<std::string_view>{}(pair.second);
	}

	size_t operator()(const std::pair<size_t, TokenType>& pair) const {
		return std::hash<size_t>{}(pair.first) ^ pair.second;
	}
};

enum ActionType {
	SHIFT,
	REDUCE,
	ACCEPT
};

struct Action {
	ActionType type;
	size_t value;
};

std::unordered_map<std::pair<size_t, TokenType>, Action, PairHash> actionTable {
	{{ 6, t_EOF }, {REDUCE, 4 }}, {{ 0, t_LP }, {SHIFT, 2 }},
	{{ 8, t_RP }, {SHIFT, 10 }}, {{ 6, t_LP }, {REDUCE, 4 }},
	{{ 11, t_LP }, {REDUCE, 1 }}, {{ 5, t_RP }, {SHIFT, 11 }},
	{{ 11, t_EOF }, {REDUCE, 1 }}, {{ 4, t_EOF }, {REDUCE, 2 }},
	{{ 10, t_RP }, {REDUCE, 1 }}, {{ 4, t_LP }, {REDUCE, 2 }},
	{{ 7, t_LP }, {SHIFT, 7 }}, {{ 9, t_RP }, {REDUCE, 4 }},
	{{ 3, t_EOF }, {REDUCE, 3 }}, {{ 3, t_LP }, {REDUCE, 3 }},
	{{ 1, t_EOF }, {ACCEPT, 0 }}, {{ 1, t_LP }, {SHIFT, 2 }},
	{{ 2, t_RP }, {SHIFT, 6 }}, {{ 2, t_LP }, {SHIFT, 7 }},
	{{ 7, t_RP }, {SHIFT, 9 }}
};

std::unordered_map<std::pair<size_t, std::string_view>, size_t, PairHash> gotoTable {
	{{ 0, *strings.find("List") }, {1}}, {{ 0, *strings.find("Pair") }, {3}},
	{{ 1, *strings.find("Pair") }, {4}}, {{ 2, *strings.find("Pair") }, {5}},
	{{ 7, *strings.find("Pair") }, {8}}
};

Action Table, Goto Table

The generated actionTable is an std::unordered_map that takes as 'key' an std::pair of the parse function's current state and the next token, to return as 'value' an Action struct object.

It defines the action to be taken; SHIFT, REDUCE, ACCEPT, given a current state and a next token.

The Action struct contains two fields: type (SHIFT, REDUCE, ACCEPT) and value. For each type, the value field is interpreted differently:

Note: PF ('parse function'), NT ('next token')

SHIFT: Given the PF current state, and NT, value represents the PF next state.
REDUCE: Given the PF current state, and NT, value represents an index to reduce_info.

reduce_info is an std::vector of std::pair elements. For each std::pair element, the first field is the non-terminal symbol to reduce to, and the second field is the number of states to pop off the PF state stack.

This non-terminal symbol information can be used for creating Abstract Syntax Tree nodes during the parse.
ACCEPT: Given the PF current state, and NT, value is redundant. This is an accepting state, signifiying a validated input.

The generated gotoTable is an std::unordered_map that takes as 'key' an std::pair of the parse function's current state and the LHS production symbol to be reduced to, to return as 'value' the PF next state.

It defines the next state given a current reduce non-terminal symbol and a next token.

These tables and helper structs can then be referenced within the parser's parse() function.

Parentheses parse function

Here's the parse() function for the above specified parentheses grammar:

parentheses/parentheses.h

static bool parse(std::vector<TokenType> tokens) {
	std::stack<size_t> states;
	states.push(0);
	auto state = static_cast<size_t>(0);

	for (auto i = 0; i < tokens.size();) {
		state = states.top();
		if (actionTable.find(std::make_pair(state, tokens[i])) == actionTable.end()) {
			return false;
		}
		
		Action& action = actionTable[std::make_pair(state, tokens[i])];

		if (action.type == ActionType::REDUCE) {
			auto index = actionTable[std::make_pair(state, tokens[i])].value;
			auto pop_count = reduce_info[index].second;
			while (pop_count--) {
				states.pop();
			}
			state = states.top();
			auto reduce_symbol = reduce_info[index].first;
			auto next_state = gotoTable[std::make_pair(state, reduce_symbol)];
			states.push(next_state);
		}
		else if (action.type == ActionType::SHIFT) {
			auto next_state = actionTable[std::make_pair(state, tokens[i])].value;
			states.push(next_state);
			// Only SHIFT actions proceed to the next token
			++i;
		}
		else if (action.type == ActionType::ACCEPT) {
			return true;
		}
	}
}

For those familiar with table-generated parsers, the algorithm is easy to follow. For those who aren't, the parse function takes a list of tokens ordered by their chronological position in the input source file, iterating through them one by one.

Implicitly assuming an initial state of 0, the parse function progresses to subsequent states based on the 'action' mapped to the current state and the next token in the chronologically-ordered list of tokens.

If the pair of current state and next token do not exist in the generated actionTable, the parse function has entered an invalid state, and should return false signifiying a failed parse.

Else, the parse function retrieves the 'action' mapped to that pair, and executes the corresponding behaviour tied to that 'action':

SHIFT action: push the next state to the state stack. Move to the next token.
REDUCE action: pop x states off the state stack. Where x isreduce_info[reduce-action-value].second. Refer to parentheses parse function.
ACCEPT action: return true.

Precedence and associativity

The grammar specification syntax allows you explicitly specify precedence and associativity behaviour of terminals. This is commonly useful in evaluating mathematical expressions.

As an example, here's the grammar specification of a simple 'mathematical expressions' grammar:

t_EOF
t_PLUS 1 l
t_MINUS 1 l
t_TIMES 2 l
t_DIVIDE 2 l
t_NUMBER 4
t_LP
t_RP

Statement > Expression
Expression > t_NUMBER
Expression > Grouping
Expression > Add
Expression > Sub
Expression > Mul
Expression > Div
Expression > Unary
Grouping > t_LP Expression t_RP 2
Add > Expression t_PLUS Expression
Sub > Expression t_MINUS Expression
Mul > Expression t_TIMES Expression
Div > Expression t_DIVIDE Expression
Unary > t_MINUS Expression 3

Terminal precedence

In the terminal definition block,

t_EOF
t_PLUS 1 l
t_MINUS 1 l
t_TIMES 2 l
t_DIVIDE 2 l
t_NUMBER 4
t_LP
t_RP

Following the terminal name definition, on the same line, the terminal's precedence value and associativity behavior can be explicitly set.

Precedence values must be integers (positive or negative). If not explicitly set, the terminal precedence value defaults to 0. The higher the precedence value set, the higher the terminal precedence.

Associativity behaviour must be one of l (left-associative), r (right-associative), or n (non-associative). If not explicitly set, the associativity behavior defaults to n (non-associative).

In the above terminal definition block, the t_PLUS (+) and t_MINUS (-) terminals (binary operators for addition and subtraction respectively) have the same precedence value; 1, and associativity behavior; l (left-associative), specified.

However, t_TIMES (*) and t_DIVIDE (/) (binary operators for multiplication and division) both have a higher precedence value; 2. Multiplication and division, as a result, would have a higher precedence than addition and substraction (as expected).

Numbers, t_NUMBER, the most important node of any mathematical expression, has the highest precedence value set 4.

Explicit rule/production precedence

The precedence of a rule/production can be set by specifying an integer value at the end of a rule/production. For example,

Unary > t_MINUS Expression 3

Note that the Unary rule expects the same terminal as a Sub rule; t_MINUS. However, it's standard mathematical understanding that Unary operations are evaluated before any other binary operations.

Explicit rule precedence forces the parser to somewhat override the already set precedence value of t_MINUS, treating it as an implicit Unary terminal instead (within the Unary rule/production) with the explicitly set precedence value.

Conflicts

With table-driven parsers, two kinds of conflicts potentially arise in the table-generation process: SHIFT-REDUCE and REDUCE-REDUCE conflicts.

SHIFT-REDUCE conflicts

Consider an expression, 2 + 2 * 4. This could be evaluated in two different ways.

On the one hand, the addition can be evaluated first, 2 + 2, and the result evaluated against the * 4, like so; (2 + 2) * 4. This would produce 16 as output, which is incorrect.

Fortunately, we understand that the precedence of mathematical operators demands that multiplication operations be evaluated before addition operations. That is, 2 + (2 * 4), to produce 10 as output.

How to encode this behavior in the parser-generator? Precedence and associativity rules, of course.

Simply put, a SHIFT-REDUCE conflict occurs when both a SHIFT and REDUCE action would produce a valid parse function next state. These conflicts are resolved by applying these conditions in order:

Let rule be a rule subject to reduce and term be a terminal/token that is encountered on input.

If explicit rule precedence is bigger than term precedence, perform a REDUCE.
If precedence of last terminal in rule is bigger than term precedence, perform a REDUCE.
If precedence of last terminal in rule is equal to term precedence and last terminal in rule is left-associative, perform a REDUCE
Otherwise, perform a SHIFT.

REDUCE-REDUCE conflicts

In some cases, the language is ill-formed and the grammar specification on REDUCE actions is unclear. That is, more than one rules/productions have the same LHS.

This is a fatal error that results in the parser generator terminating early with an error message identifying the rules/productions leading to ambiguous REDUCE actions.

Testing

Rather than a bunch of carefully hand-picked test-cases, this repository includes two REPL interpreters for both the 'mathematical expressions' grammar and the 'parentheses' grammar.

Mathematical Expressions Interpreter

Run these commands within the build/ directory to build and then run the 'mathematical expressions' interpreter:

$ ninja expressions
$ ./expressions
Math Expressions Evaluator ('q' or CTRL-C to exit)
>

You can now verify the correctness of the parser's expression evaluation order of fundamental (*, +, -, \, Unary, Brackets) mathematical operations.

Parentheses Interpreter

Run these commands within the build/ directory to build and then run the 'parentheses' interpreter:

$ ninja parentheses
$ ./parentheses
Parentheses Grammar Interpreter (enter 'q' or CTRL-C to exit)
>

Similarly, you can verify the correctness of a string of parentheses; that is, every left parenthesis ( has a matching right parenthesis ) in the correct order (i.e. balanced and well-formed).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Yet Another Parser-Generator (YAPG)

Contents

Why?

Installation

Usage

LR(1) Grammar Specification Syntax

Parentheses grammar

Action Table, Goto Table

Parentheses parse function

Precedence and associativity

Terminal precedence

Explicit rule/production precedence

Conflicts

SHIFT-REDUCE conflicts

REDUCE-REDUCE conflicts

Testing

Mathematical Expressions Interpreter

Parentheses Interpreter

Files

README.md

Latest commit

History

README.md

File metadata and controls

Yet Another Parser-Generator (YAPG)

Contents

Why?

Installation

Usage

LR(1) Grammar Specification Syntax

Parentheses grammar

Action Table, Goto Table

Parentheses parse function

Precedence and associativity

Terminal precedence

Explicit rule/production precedence

Conflicts

SHIFT-REDUCE conflicts

REDUCE-REDUCE conflicts

Testing

Mathematical Expressions Interpreter

Parentheses Interpreter