-
Notifications
You must be signed in to change notification settings - Fork 2
Parser Rules
See the cmd/pie
graphical editor and existing example Pie project in langs/golang/go.pip
for a full set of parser rules for parsing Go. You should always start any new project by copying the closest existing example -- all programming languages share many common elements.
The .pig
grammar output file summarizes all the parser rules for a given language, and is the best way to get an overview of the overall structure. It isn't quite in standard BNF but more a JSON-esque version thereof.
The core of each Rule is the Rule
field, which is a sequence of space-separated elements, that are either names of other rules or literal tokens to match. The essential idea is that you are specifying the sequence of input that this rule matches. Here are the special syntax rules for this field. You can also look at the Rules
field to see the compiled version of the Rule field to verify it got everything right.
- Tokens must be surrounded by regular single quotes
' '
. Keywords are:'key:keyword'
. - All tokens are matched at the same nesting depth as the start of the scope of this rule, unless they have a +D relative depth value differential before the token. This means that the following
Block
definition will only match brackets at the same level of nesting depth, with this rule -- this is essential for getting the correct matching bracket among the many typically possible (and that happens very quickly due to the precomputed and optimized nesting depth computation done in PassTwo):
'{' ?StmtList '}' 'EOS'
-
@
prefix for a sub-rule to require that rule to match -- by default explicit tokens are used if available, and then only the first sub-rule failing that. -
?
prefix for optional sub-rule -- not used for matching but if rest of rule matches and is executed, then it checks to see if that sub-rule matches and uses it if so. -
!
by itself (not as a prefix) to define start of an exclusionary rule -- doesn't match when those rule elements DO match. -
:
prefix for a special group node that matches a single token at start of scope, and then defers to the child rules to perform full match -- this is used for FirstTokMap when there are multiple versions of a given keyword rule. -
-
prefix for tokens anchored by the end (next token) instead of the previous one -- typically just for token prior to 'EOS' but also a block of tokens that need to go backward in the middle of a sequence to avoid ambiguity can be marked with -.
Each Rule name must be unique throughout entire grammar. There is a global RuleMap that has all the rule names that is used for quick lookup. Any rule can refer to any other rule, including children etc.
Rules with Children (i.e., Group
rules) just iterate over the children and attempt to match any of them, in order. Except for one special case (the :
prefix for FirstTokMap), a rule should either have Children or a specific Rule
it matches, but not both.
To keep things better organized, we use top-level groups such as TypeRules
FuncRules
etc to organize all the rules relevant to a given category. These top-level groups are never used in other rules, and would be nonsensical.
The first top-level group is special: it is typically called File
and contains the entire set of rules that can match at the top-level of a file -- only these rules are iterated over to process anything when the parser is called. The parser is called repeatedly until it reaches the end of the file, or it stops making any progress, and so these File rules must handle anything that can be found at the top level. This also includes when the parser is called on a single line from a file -- typically this requires a statement-level rule, even in languages such as Go that do not allow generic statements at the top-level.
For a top-level function, for example, the entire function will be processed by a single top-level rule, which will grab the overall depth-delimited scope for that function, and then progressively chop that large chunk of text into smaller and smaller elements (statements typically). Likewise, an if
statement encompasses the entire if
, including any and all else
, else if
clauses, etc.
A large amount of work is done by the ExprRules
, which handle any kind of expression -- this is where a lot of complexity and power of the language happens. The top-level entry point is typically Expr
which then decomposes into BinaryExpr
and UnaryExpr
(using child aliases for greater clarity).
-
BinaryExpr
is organized in reverse precedence order, and largely uses special Reverse Logic to achieve the proper associativity of mathematical operations, as mentioned in the README, and needed for top-down recursive-descent parsing. E.g.,AddExpr
is as follows, with the initial minus sign triggering the reverse logic:
-Expr '+' Expr
-
UnaryExpr
are all the unary expressions (-UnaryExpr
forNegExpr
etc) -- the last entry here is then thePrimaryExpr
which are the basic elements of expressions, including various types of literal numbers, but also a lot of other complex elements.. -
PrimaryExpr
elements include various important rules such asSelector
which parsesstruct.field
expressions (andpackage.el
too -- it can't usually tell the difference in parsing) -- these can be chained along with lots of other possible expressions includingSlice
that takes the[index]
array / slice value of an expression -- to allow these to compose properly, Slice has optional?
elements:
?PrimaryExpr '[' SliceExpr ']' ?PrimaryExpr
- Overall, the key to getting all this working correctly is to ensure that each rule has a unique set of concrete tokens that distinguish this rule from any others. Because rules are tested in order, you can put more specific rules first, so the more general case rule will only run if the special case didn't match. All the different possible options for a given level of expression should be available within the children at a given level (or referred off to another such group using a simple alias).
The Trace
settings for the parser provide the most useful form of debugging info for figuring out why a rule is not firing when it should, or firing when it shouldn't.. In the Pie editor, just click the Edit Trace
button on the toolbar and select the steps you want to get info about. The NoMatch
case generates a lot of output so it is often better combined with a filter in the Rules
field which selects a subset of (space separated) rules to report trace info on. You should generally be able to narrow it down to a few relevant rules and enter their names to get more useful output.
The trace results are shown in the Output
tab.
Use the OptTokMap
flag to create a map of all the tokens within the current scope, for faster rejection of non-matching items. Typically this is set for the Expr
, ExprList
and Type
rules, which are where the bulk of the parsing time and "entropy" are (i.e., the highest uncertainty about what kind of tokens might appear next).
Use FirstTokMap
on a Group rule (i.e., a rule with children) for cases where the children each have distinctive starting tokens -- in this case, the Group can read the next token, and do a quick map-lookup to select the rule with that starting token. If there are multiple child rules with the same starting token, you need to put those in a sub-group with a rule that just has the starting token using the :
prefix at start -- this is a special kind of group just for this case to allow that token match and then the group selects which of those child rules within it match.
This makes a big difference in performance (100's of msec on big files) -- see the Go UnaryExpr
, PrimaryExpr
and LiteralType
and Stmt
rules for examples.
For rules that match large blocks of code surrounded by { }
, with an EOS
at the end, you can specify a minus sign for the final bracket: -'}' 'EOS'
which will then anchor the match of that bracket to be just before that final EOS -- EOS finding is highly optimized by depth etc, and so this makes finding that final bracket much faster. Typically this actually doesn't run very often and makes little difference in overall performance but it feels better to be more efficient ;)
Some expressions are mis-parsed if read in a strictly forward manner (as is the default) -- for example:
switch apv := aps.Value.(type) {
The .(type)
expression is key to disambiguating this from other kinds of "path" expressions that are not type-switches, but if it reads it forward, it is difficult to get that match to work properly -- it works great however if you anchor the match on that {
and then match backward from there for the type
keyword etc.
This can be achieved by marking a block of tokens all with -
prefixes and it will re-order the processing of those and anchor them "from next" instead of "from previous". See the Go SwitchTypeName
and SwitchTypeAnon
rules.
Before figuring out a better solution, there was a basic confusion about the pointer vs. multiplication meaning of * that seemed fundamentally ambiguous, without knowing what is a type name vs. a var name etc:
var SliceAry = [2]*Rule{} // an empty array of *Rule pointers
var MultSlice = p[2]*Rule // multiplication of element times variable
In an attempt to disambiguate this, we put an exclusionary condition on the MultExpr
rule:
-Expr '*' Expr ! ?'key:map' '[' ? ']' '*' 'Name' ?'.' ?'Name'
Where everything after the !
is used to exclude the multiplication operator, which has to come first in the parsing by virtue of it needing to organize the expression into larger top-down chunks, relative to the unary * interpretation. The exclusion rule works by aligning the '' in the exclusion condition with the '' in the main rule, and then going forward from that point and then backward -- if everything matches (that is not optional) then it is excluded.
While this was not useful in the end for this problem, it may yet prove handy. Btw, standard Unary * works because the binary interpretation does not match when * is preceded by another operator, which is typically the case for ambiguous binary operator / unary cases.
We ended up solving the above ambiguity by adding a var
decl case that first checks for CompositeLit expression before falling back on generic Expr -- this gives that a special precedence for this case while allowing regular binary * to operate otherwise for generic expressions.