Add initial lossless syntax tree implementation #443

Code-Grump · 2025-02-11T02:05:14Z

🤔 What's changed?

This adds a new project Reqnroll.CodeAnalysis.Gherkin which provides a lossless syntax tree and a parser implementation intended for use with code generation, code-analysis and code-fix implementations operating over the Reqnroll flavour of the Gherkin language.

The syntax tree follows a fundamental rule that all input text is represented in the output tree:

Syntactically-significant text becomes syntax tokens.
Other text (such as whitespace) becomes syntax trivia and is attached to tokens.

Syntax tokens are grouped by syntax nodes which form the higher-order syntax structure of a feature file.

This initial implementation intends to fix the public API. The internal implementation will have room for performance improvements.

The most important detail is that all tokens, trivia and nodes are immutable and are represented in two flavours:

An internal "raw" flavour which contains the text of the token and its associated data.
A public flavour which wraps a raw token, adding references to parent tokens and explicit positional information.

This model makes it very efficient to create nodes outside of a syntax tree, then insert them into one or more syntax trees, or copy nodes between syntax trees - the raw nodes do not maintain a reference to their parents, so they can appear in multiple hierarchies. This will also make it considerably easier and efficient to implement any kind of incremental parsing in the future, where nodes that are unaffected by a source change can be copied into the new tree.

⚡️ What's your motivation?

This model will provide the specific positional information necessary to emitted code statements in Roslyn generators to their Gherkin source. It will also make it easier to implement more complex transforms, such as embedding the source document into Gherkin Messages. Finally, it will provide a simple and efficient model for driving incremental generation, which can be improved by adding incremental parsing as a future enhancement without needing to change the generator logic.

🏷️ What kind of change is this?

⚡ New feature (non-breaking change which adds new behaviour)

♻️ Anything particular you want feedback on?

📋 Checklist:

I've changed the behaviour of the code
- I have added/updated tests to cover my changes.
My change requires a change to the documentation.
- I have updated the documentation accordingly.
Users should know about my change
- I have added an entry to the "[vNext]" section of the CHANGELOG, linking to this pull request & included my GitHub handle to the release contributors list.

This text was originally taken from the template of the Cucumber project, then edited by hand. You can modify the template here.

Code-Grump · 2025-02-11T02:13:56Z

I've created this PR at this stage for feedback on what exists so far. I would suggest the following types should be given the most attention:

SyntaxNode and RawSyntaxNode
SyntaxToken and RawSyntaxToken
SyntaxTrivia and RawSyntaxTrivia
SyntaxFactory and InternalSyntaxFactory
SyntaxTree
ParsedSyntaxTreeBuilder

The codebase intends to emulate the Microsoft.CodeAnalysis concepts for familiarity, using the same concepts of tokens, trivia and syntax nodes generally. The public API should not be considered complete (there are many [obvious additions that could be made) but should be considered representative of the core behaviour desired by the code generators.

The current implementation is able to parse a feature file with a feature declaration, but no scenarios. I've added some support for handling errors, but there's a lot more work that needs to be done in that space to handle every kind of failure.

Through creating the structures for representing a feature file and the single feature declaration, I can see a lot of opportunity to generate code. Once I get a sense that we're happy with this structure and would like to proceed, I'll investigate how we might generate all the desired structures following the same patterns.

gasparnagy · 2025-02-11T07:23:12Z

Thank you. I like the approach of making small steps towards our generation goals. 🤘

To get a design overview of the change, do I see it correctly, that this change basically uses the Gherkin parser, but provides an alternative AST builder and token scanner/matcher that fits to the purpose, correct?

Also if I understand correctly this is now an alternative parser that is currently not being used yet, but would be used for Roslyn code gen and also potentially for the legacy CodeDom based generation as well. How do you see that?

I'm very bad in reviewing things that does not yet have a real usage. 😎 Would it be possible to add some simplistic tool (maybe a basic linter or whatever that is the easiest) that could demonstrate the use of the new parser? It does not have to be something we release. That would give probably a few good input for the design (at least for me).

Code-Grump · 2025-02-11T10:19:58Z

To get a design overview of the change, do I see it correctly, that this change basically uses the Gherkin parser, but provides an alternative AST builder and token scanner/matcher that fits to the purpose, correct?

That's right: there's an implementation of IAstBuilder which composes a tree using a handler class for each rule type.

Also if I understand correctly this is now an alternative parser that is currently not being used yet, but would be used for Roslyn code gen and also potentially for the legacy CodeDom based generation as well. How do you see that?

It's intended to be used for Roslyn-based generators, but could be used in any generation space. It could also be very valuable in all IDE plug-ins, where performing code refactoring would be possible by modifying the syntax tree and writing the result back to the source file: because we preserve whitespace, changes respect the source document format and can be performed with a surgical precision.

I'm very bad in reviewing things that does not yet have a real usage. 😎 Would it be possible to add some simplistic tool (maybe a basic linter or whatever that is the easiest) that could demonstrate the use of the new parser? It does not have to be something we release. That would give probably a few good input for the design (at least for me).

Naturally, I could create a Roslyn-based generator that consumes the parsed result, but that only reads the tree and projects into another format. If you want something that performs more complex analysis or modification, I could imagine something up or would be happy to have some concrete tasks.

gasparnagy · 2025-02-11T10:59:15Z

Naturally, I could create a Roslyn-based generator that consumes the parsed result, but that only reads the tree and projects into another format. If you want something that performs more complex analysis or modification, I could imagine something up or would be happy to have some concrete tasks.

I see. The goal would be to get an impression on how easy it is to work and manipulate the AST. But this sample usage is only throw-away code, so we should not put much effort into that. Maybe we could make one that transforms the feature file to a markdown document. It doesn't have to be wrapped to a Roslyn-based generator, but you have mentioned "incremental parsing" and I have no idea how we could verify that (I mean whether the design really supports incremental parsing).

But just guessing here. We can even drop this "sample usage" idea if it is too overwhelming.

Also fixes a bug iterating child nodes and tokens

Code-Grump · 2025-02-12T01:31:46Z

I've dropped in a single example into the associated test project which demonstrates a hypothetical analysis: detecting a lack of a blank line between the feature headline and its associated description. It doesn't go into detail of how you could provide a "fix" for this error, but one could be provided if useful.

Code-Grump · 2025-02-12T11:04:35Z

Regarding incremental parsing support: there's two things the design of this model does to support this future desire.

Firstly, the raw nodes do not have any concept of "ownership"; they can be aggregated by multiple parent nodes to appear in multiple valid syntax trees simultaneously. Combined with their immutability, this makes them safe to be re-used between parses.

Secondly, equality checks in the public syntax tree are made fast by doing reference-based checks on the raw nodes. This means that if we were to re-use nodes between incremental parses, a consumer is able to spot just the new nodes very cheaply. This is the primary benefit of incremental parsing for the consumer: an ability to detect changes and only process a subset of the total tree.

Additionally, because the tree includes all input text, it makes it trivial to understand which portion of the tree is directly affected by a change in source text: we just get the span that has been modified and find what nodes that span crosses; all nodes touched have to be parsed again and untouched nodes are safe to re-use.

gasparnagy · 2025-02-12T11:14:54Z

Regarding incremental parsing support: there's two things the design of this model does to support this future desire.

Maybe it's not relevant, but how is this related to line positions? What happens if an extra line inserted to the top of the file (so everything is shifted)? Will everything else below become "invalid"?

Code-Grump · 2025-02-12T14:16:40Z

Regarding incremental parsing support: there's two things the design of this model does to support this future desire.

Maybe it's not relevant, but how is this related to line positions? What happens if an extra line inserted to the top of the file (so everything is shifted)? Will everything else below become "invalid"?

Great question! This scenario invalidates all touched positions, but explicit positions are only in the public model: they are calculated from the width of raw tokens and trivia.

This means we can look at a change like this, process a content insertion, and re-use the raw nodes which have been "pushed around". This model can re-use individual tokens or whole subtrees whilst accepting surrounding content changes.

Code-Grump · 2025-02-16T23:38:03Z

If possible, I'd like to know if there's any feedback on the overall design at this stage.

My next step is to implement diagnosis support and then I want to code-generate the syntax class (since the code is all boilerplate,) so it'd be great to understand if there's anything to correct or present alternatives for.

gasparnagy

Thx for the prototype implementation. It is really impressive, and I haven't found any conceptual issue with that.

The analyzer example is useful. It demonstrates the usage well, and in general it is easy to understand. Nevertheless it also shows (should not be surprising) that you need to know about the Gherkin syntax and the parser terminology pretty well: I mean the part where you need to separate the different trivia and try to find a new line between potential comments... Also you need to know the difference between whitespace and end-of-line trivia. This is not a problem, just a fact.

I have made a couple of in-file comments. They are more smaller or code style related issues. Nothing serious.

I have also captured some general notes:

General Impression

It is clear right from the beginning that the code follows the Roslyn parsing school. Microsoft has invented enough effort into that, so I don't think there is too much to think whether this is good or not - anyway our goal was to make the thing easier to use for Roslyn generators/analyzers so this is fine.

From the example my first impression was that this AST could be used just as good as the AST we have now, so in principle it could replace our AST to reduce the code duplication. This is a good news.

With all that being said for me the codebase was surprisingly huge. 83 files for parsing a feature file header... (ok, we have skeletons for other nodes too, but still) So for me the message is that to have all the abstractions we stated in the goals, we have to work a lot. All my respect to you. 🙏 Probably we cannot do too much with this.

Another thing I noticed that since it follows the Roslyn terminology it is pretty hard to understand the concepts without clarifying the terminology first. I have a CS degree, attended parser generator lectures and worked with lex/yacc, antrl, etc (but never worked with Roslyn) - so I thought it will be easy for me, but not. I saw your efforts documenting the most important classes, that was a plus. After spending ~3h with the code, this is the list of terms that I noted for myself to get better understanding about (you don't have to answer these at this point, but maybe these could be explained in a concept document for later contributors. Or just point to some Roslyn guide that is a must read first):

trivia - I understand this term, but I have not used it in this context before
structured trivia - ???
syntax trivia - ???
structured trivia syntax - ??? (you have lost me at this point)
syntax node vs raw node - i think I understand this, but still not confident
*Syntax vs Internal.*Syntax - i have some rough feelings, but this is still more unclear. I think we should find a better naming for one of this, because in the current form that they only differ in namespace it is very hard to track this
*Syntax - these are the AST nodes and derive from SyntaxNode, I got that. But why they named like this and not *SyntaxNode is a mystery for me. For example "scenario declaration syntax" in English means a set of grammar rules that describe how a scenario should look like. But in Roslyn it is actually a node in the AST. Mind blowing.
syntax token - it's a token in AST (although not deriving from Syntax Node)
raw syntax token - ... 10 seconds ... ah. yeah.
slot - ??? (sometimes I saw "index" being used)
leading / trailing trivia - yet unclear whether the leading trivia of one node is the same as the trailing trivia of the previous one and if not, how to we split? In general I was not sure to which node a trivia is attached (e.g. where a comment between two steps belongs to)

Testing

It is clear that for the big amount of code, we also need a big amount of tests. The conceptual question here is how you can really test this without testing the Gherkin parser itself.

In the Gherkin project, our approach was approval testing: we have a standard set of "test files" and expected AST and tokens. As that set of test files have been proven to protect the Gherkin parser implementations in the past, I would consider reusing that for this one too.

Basically you would need to parse all sample files and write a "printer" that prints the result in the expected format so that you can compare it with the approved results. (See an example of that here). With that you could really concentrate your coded tests to the trivia handling.

gasparnagy · 2025-02-19T11:01:08Z

Reqnroll.CodeAnalysis.Gherkin/GherkinParseOptions.cs

+    /// Gets the culture to use when parsing text. This determines the language that will be used for the Gherkin keywords.
+    /// If the document contains a language directive, the directive takes precedent.
+    /// </summary>
+    public CultureInfo Culture { get; private set; }


We need to double check if CultureInfo is a valid representation of the default Gherkin language. Most of Gherkin language dialects match to CultureInfo names, but not all.

I think this has been an ongoing question for this project.

My gut feeling is that you won't be able to easily represent cultures like en-pirate this way. There's a few things here that are a little bit challenging, since we probably do want some kind of culture support: it's not just the language choice for keywords, but also how we interpret numerical values, etc. (especially if we don't want to be using the process's culture information as a fallback.)

I'm open to replacing this with something else, but I'd like us to decide whether we mean "language", "culture" or "language and an implied culture", because there's already enough ambiguity going around.

I agree. We do need the culture for running the step definitions. So far the concept was (not sure we always followed it) that "language" is the language notation of Gherkin en-pirate & co, which was used for parsing the file and the "binding culture" that was a .NET culture info for running the step definitions. While you can set the binding culture explicitly, we have a mechanism to automatically derive the binding culture from the feature language (for most of the cases).
Like everything, this can also be questioned, because it is clearly affects only a small portion of the feature file languages, but we cannot do that here in this PR because it would impact the other parser versions as well. So my suggestion is that in this PR, we should make a "compatible" parser with the new Roslyn style and apply conceptual changes in a separate step, otherwise it will be very hard to keep the things in sync.

So if we were to reflect the current situation, the value is an identifier of the grammar language to be used in parsing and sets a default for the execution culture (with the usual fallbacks.)

It's not required to be an ISO culture code supported by the host system, so using CultureInfo here is inappropriate. I need to choose an alternative data-type, and I'm always hesitant to just use a string when there's an expected subset of values. In this case, could any string be a language identifier (with spaces and emojis), or does it have to follow the ISO culture-code structure?

I don't think you can find any better than string. As you said it is an identifier for the Gherkin language setting listed here: https://github.com/cucumber/gherkin/blob/main/gherkin-languages.json . I don't think there is any restriction for the value itself, but so far it was something built up from Latin letters and -, but I don't think this was ever written down (see https://github.com/cucumber/gherkin/blob/main/CONTRIBUTING.md#adding-or-updating-an-i18n-language).

But I would like to note that the names you see at CultureInfo are not ISO standard either. They are just a fancy combination of different standards, but the way how they are combined was invented by Microsoft. https://learn.microsoft.com/en-us/dotnet/api/system.globalization.cultureinfo.name?view=net-9.0 (highlight from me)

The culture name in the format languagecode2-country/regioncode2 [...]. languagecode2 is a lowercase two-letter code as defined in ISO 639-1, or, if no two-letter code is available, a three-letter code as defined in ISO 639-3. country/regioncode2 contains a value defined in ISO 3166 and usually consists of two uppercase letters, or a BCP-47 language tag.

gasparnagy · 2025-02-19T11:06:23Z

Reqnroll.CodeAnalysis.Gherkin/GherkinParseOptions.cs

+
+    internal GherkinParseOptions(CultureInfo? culture = null)
+    {
+        Culture = culture ?? CultureInfo.CurrentCulture;


The compilation should never depend on the user's culture setting, because that causes strange behavior for teams with people from different countries and also problematic on the CI. This is why in SpecFlow/Reqnroll the overall default is en-US.

I find this a little odd and wasn't aware of this behaviour in SpecFlow and Reqnroll - the default fallback to current culture is pretty much standard across all .NET libraries. If the culture needs to be set for the purposes of running in a CI build, I think there's expectations to do it through appropriate configuration, whether that's at the CI level or via a tool-specific configuration file like ours.

I don't mind changing this to match the existing behaviour, but it surprises me that it is the existing behaviour and being surprised makes me want to question it.

Your analogy does not work fully. The analogy would be if you would need to write your floating point numbers in C# depending on the culture setting of the machine where your compile. So the C# code var f = 1.23 that compiles on yours would not compile on my machine (Hungarian culture), as you would need to write var f = 1,23. This is obviously not the case, and this is why C# language specs is deterministically set how to set the floats. The C# compiler still has some options, e.g. a code that compiles successfully with nullable disabled might cause warnings with nullable enabled, but it does not depend on environmental settings.
The feature file handling is similar to that. The parser needs to deterministically decide if a Feature keyword is valid or not during compilation time. If we would use the default culture than I could not compile your Reqnroll project as I have Hungarian culture set, where the feature keyword is Jellemző. Also in our case the "binding culture" has to be deterministic, because you might write Then the result is 1.23 into the feature file that is bound to a step definition with a double parameter. Since the feature files are representing the specification they have to be unambiguous, so it cannot happen that on my machine the test fail (because Hungarian treats . as thousand separator, or pass, because in English it is the decimal separator).
We started SpecFlow at a company who had offices in Hungary and Austria so we faced this issue right from the beginning. (Our first implementation used the current culture, but it didn't work in cross-country projects.)

So all in all, this does not behave like a library, much more like a programming language.

I think Gherkin is special in this regard, because most (if not all) other programming languages are specified using a culture-agnostic system; you don't get keywords in your language of choice for C#, or parsing floating point values with your local number-system.

For Gherkin that is the expectation: use the culture that works best to write a specification and collaborate with your team.

So the suprise for me is just that the default isn't "your local culture" as it would be for other culture-sensitive tools. I appreciate that didn't work for the original authoring project, which obviously steered the choice to an explicit default.

use the culture that works best to write a specification and collaborate with your team [...] So the surprise for me is just that the default isn't "your local culture"

Welcome to my world. It's hard to make statistics, but in the most of the countries (the non-English speaking ones), many (the majority?) devs work on domain culture that is not the same as their local culture. In the last 20 years I never worked on a Hungarian domain for example. There have been some Hungarian clients, but those were part of (or planned to be part of) global organizations, so the project domain language was English (few cases German). But even in UK/US you would see many devs who has their Windows set for their native language/culture.

[...] as it would be for other culture-sensitive tools.

It's hard to find a good comparison and culture-sensitivity is a complex topic. You can think about Excel for example, were you want to fix references, formulas and calculations and you want that to work the same everywhere. Excel supports culture-sensitivity, but it saves the result into the file in a culture-agnostic way. So I can make an excel sheet that uses Hungarian float numbers and make a cell with the formula =ÖSSZEG(A:A) (this is the Hungarian version of SUM), but if you would open the file, you would see the result in UK decimal separators and the cell would show you =SUM(A:A).

I cannot really name any tool, where the created artifact would be culture-dependent and would be interpreted differently on machines with other cultures. That would lead to a chaos.

I think what you mean is that if our VS extension would be culture specific, the project wizard would generate a Reqnroll project with the culture of the current user. (So for me it would generate a Reqnroll project with the default language and the sample feature file using Hungarian.) Interestingly we have tried this with SpecFlow, but we got a massive resistance from the users (probably because of the reason I mentioned above) so we had to revert it back to generating English projects by default.

Thanks for indulging me in this topic. It's been really interesting to see how the decision was made. I think the parallels with Excel are probably the closest to what we have: parse the input in one language but encode the output in an agnostic fashion (for us, C# tests.)

Once we've decided how to properly encode the language configuration, I'll make en-US the default here.

gasparnagy · 2025-02-19T11:10:54Z

Reqnroll.CodeAnalysis.Gherkin/IsExternalInit.cs

@@ -0,0 +1,22 @@
+// Licensed to the .NET Foundation under one or more agreements.


I think we should put these marker/polyfill classes to a separate folder (.polyfill?) to keep the project structure clean.

gasparnagy · 2025-02-19T11:13:19Z

Reqnroll.CodeAnalysis.Gherkin/SourceTokenScanner.cs

+
+class SourceTokenScanner(SourceText source) : ITokenScanner
+{
+    private readonly SourceText _source = source;


I have enabled using _ for the ctor parameter names in the Reqnroll project to avoid such field declarations/settings. So you can simply name the constructor parameter as _source as long as it is private.

Reqnroll.CodeAnalysis.Gherkin/Reqnroll.CodeAnalysis.Gherkin.csproj

gasparnagy · 2025-02-19T11:47:43Z

Reqnroll.CodeAnalysis.Gherkin/Syntax/Internal/ParsedSyntaxTreeBuilder.cs

+    {
+        _cancellationToken.ThrowIfCancellationRequested();
+
+        Debug.Assert(


In my experience the Debug.Assert calls may fail the tests in a very awkward way, that the error is not shown. So I stopped using it in my projects. Maybe there is a good practice around it, but I could not figure out what.

We can create our own version that fails in a more test-friendly way - the objective was to have some kind of test that won't be present in the release version, because they should be invariants checked by the public API. Happy to introduce a GherkinDebug class or similar.

I'm fine with that idea.

gasparnagy · 2025-02-19T11:51:59Z

Reqnroll.CodeAnalysis.Gherkin/Syntax/Internal/FeatureFileSyntax.cs

+
+namespace Reqnroll.CodeAnalysis.Gherkin.Syntax.Internal;
+
+internal class FeatureFileSyntax : RawNode


In Gherkin this is called "Gherkin document", so I would not use the term "feature file" here. (At least I was confused about it.

The question of what to call things is absolutely an open one. I've leaned more towards the domain language I've encountered in teams writing feature files (heh, we really do call it a feature file.) If there's a domain language for the Gherkin syntax that you feel we should adopt, I'd be quite happy to.

We had a lots of discussions about this in the Cucumber/Gherkin community and we got to a consensus, so we should use that.

gasparnagy · 2025-02-19T11:55:43Z

Reqnroll.CodeAnalysis.Gherkin/Syntax/Internal/FeatureDeclarationSyntax.cs

+
+namespace Reqnroll.CodeAnalysis.Gherkin.Syntax.Internal;
+
+internal class FeatureDeclarationSyntax : DeclarationSyntax


I'm a bit confused about the term "feature declaration". If this will represent the entire feature (vs "feature header") then I would just call it "feature" because the term "declaration" is not used in Gherkin anyway and can be confused with "header". The related rule handler is called "FeatureRuleHandler" anyway.

This naming is a reflection of how Microsoft describes their languages and how my teams have referred to these sections of Gherkin documents. Feature:, Rule: and Scenario: are all examples of "declarations", mostly because they have a colon in them that linguistically declares something.

Additionally, the syntax structure encapsulates everything: the header of the feature, it's description and (eventually) all the rules and scenarios. I took inspiration from ClassDeclarationSyntax when considering the structure of features.

As before, if we'd like to use a specific domain language or syntax structure, I'm open to following whatever we think best aligns with our goals, and my experience in this space is heavily skewed by Roslyn's specific quirks.

Like above, I would just stick to the Gherkin domain terminology. I recommend looking at the Gherkin grammar file, that clarifies all terminology related questions: https://github.com/cucumber/gherkin/blob/main/gherkin.berp

It does not make much sense to diverge. The users of the parser should be primarily domain experts in Gherkin and only secondarily in Roslyn. We have suffered a lot with every cucumber implementation naming the things differently because the Gherkin language was not standardized. Now it is, so let's be happy with it.

Happy to follow that specification. There are bits that don't follow the patterns of the C# language that will make it feel a bit odd in this space, but I'm going to be much happier following a standardised structure.

Do you think berp could be used to generate any of the structures we need? My current plan was to roll out a Roslyn generator that churned out the boilerplate required for a structure by me just defining the properties. I know how Roslyn generators work, so I know how I could do this. Berp is a lot less-clear to me, but it understands the source specification.

Also, is there a way to embed documentation in that specification? The alternative will be linking in some additional external source to join up. I really want to be able to provide detailed XML comments for all publicly accessible types.

Berp parses the grammar file and generates a parser but with a custom specified template file, you can generate artifacts like this as well (I believe). The only challenge is that it currently only supports Razor templates that are not so easy to maintain. We can make some experiments with this.

We use Berp in the Gherkin project in a way that it integrates to the compilation pipeline and generates the parser class that is going to be compiled into the result. So the outer view is similar to Roslyn source gen.

gasparnagy · 2025-02-19T12:33:15Z

Reqnroll.CodeAnalysis.Gherkin/Syntax/Internal/ParsedSyntaxTreeBuilder.DescriptionRuleHandler.cs

+
+using static InternalSyntaxFactory;
+
+internal partial class ParsedSyntaxTreeBuilder


I generally do not like partial classes, but combined with nested types is really hard to track. We are already in an "Internal" namespace, so I would just promote these nested types into normal types. You can add them to a subfolder if that helps.
In the current way, as the first part of all partial class file is the same, it is too hard to find anything. For example here in GitHub I see this...

There are a few things that can make nested types better than separate types, like access to privates in the parent, but I absolutely agree about the obnoxious naming making it hard to find things. Early on I suspected I would be using a lot of code-generation going forward and started to move the partials out to separate classes to conceptually separate them from the code I expected to remain hand-written. Ironically, the ParsedSyntaxTreeBuilder ends up being one of the classes that is more likely to need to be hand-written and I was expecting to refactor to full classes at some future point.

gasparnagy · 2025-02-19T12:34:41Z

Tests/Reqnroll.GherkinCodeAnalysisTests/ParsingTests.Features.cs

+
+using static SyntaxFactory;
+
+public partial class ParsingTests


There will be many sub-aspects of parsing, I would put them into their own class, and not use partial class. In that way it will be also easier to review the test execution result (vs a big flat list).

Code-Grump · 2025-02-20T16:07:36Z

Thanks again for the quality feedback @gasparnagy - getting direction is what I need to move this forward.

Regarding the terminology you've inquired into, let me see if I can clarify:

Trivia - Anything that's not syntax is trivia. It's an attempt to separate the things that are the syntax structure "meat" of a code file from the noise around it. Comments and whitespace are the two biggest examples. Trivia is still considered important to the parser (because the presence or absence of whitespace changes the text and therefore the syntax) but it's considered less-important than the syntax itself.
Structured Trivia - Sometimes trivia is not just plain text, but has some structure with meaning. In C# something like a #pragma directive would be treated as structured trivia. For Gherkin, the language comment definitely counts. It also has a secondary purpose to contain skipped nodes: nodes (and their associated structure) that could not be interpreted by the parser. We must still include them (as all input text must appear in the output), but cannot include them as valid syntax.
Syntax Trivia - A more formal name for trivia.
Structured Trivia Syntax - The special syntax node which can be embedded as trivia.
Syntax Node vs Raw Node - SyntaxNode is the public face of structured nodes in the tree. All nodes which contain children (either more structured nodes or tokens) inherit from this class. RawNode is a type which represents a point in the internal tree structure. All internal tree points inherit from RawNode.
Syntax vs Internal.*Syntax - Each pair represents the public face and the internal implementation of some kind of structure in the tree. I struggled with naming the whole Internal namespace and would welcome alternative names.
*Syntax - As you said, these are following the naming conventions from the Roslyn tree. I'm not sold on them as being the "correct" way to do this, but there is a nice simplicity in seeing a type called FeatureDeclarationSyntax and being able to infer "this must represent the syntax of declaring a feature." Roslyn's error may be that SyntaxNode is incorrectly named, as it does not represent the base for all nodes in the tree, only those which are structured nodes. We are not required to follow this precedent and it's merely done here to help draw parallels.
Syntax Token - The type SyntaxToken is meant to represent an atomic span of text that is part of the core syntax. It wraps a RawNode like every other point in the tree, but it's a structure for performance reasons and that excludes it from inheriting from SyntaxNode.
Raw Syntax Token - The implementation of a syntax token in the internal tree.
Slot - This one is completely borrowed from the Roslyn implementation. In structured elements (that's structured syntax and lists), there needs to be common mechanism to gain access to the child items that "slot" into the spaces provided by those structures. It's naming is intended to make clear that if you ask for the contents of a slot, you will get a null value to indicate the slot is empty. This differs from asking for "children" which will only give you the children that actually exist. Additionally, most structures have a fixed number of slots determined by the type of structure, rather than being an expanding collection.
leading / trailing trivia - This one seems arbitrary, but makes sense the more you think about it. When parsing the source text, leading trivia is all the whitespace, comments and other trivia elements that exist before a token. Once the token is reached, that buffered trivia is attached to that token. The exception to this rule is that whitespace at the end of a line is appended as trailing trivia to the last token on that line. The intended effect is that if you were to remove the syntax from the tree, you would conveniently get all the line-breaks and comments that lead up to the syntax which typically exist to support the syntax, as well as the end-of-line that happens to terminate the syntax. Additionally, trivia is only attached to tokens, however you can ask any node for its leading and trailing trivia and it will get the trivia attached to the first and last token it contains in document order.

I hope this helps clarify and generates further feedback. 😁

gasparnagy · 2025-02-21T10:08:59Z

I hope this helps clarify and generates further feedback.

That is definitely helpful. 🙏 I have just noted that you keep using the term "syntax" even in normal text for something that only makes sense in the Roslyn bubble, and I continuously need to transcribe. 😵‍💫 So for example when you say "Anything that's not syntax is trivia." that is in a "normal" language does not make sense, because trivia (whitespace, comments, etc) are do part of the syntax of a language! But this just shows how bad idea was from MS to shorten "normal language syntax node" to "syntax". We'll get used to it. (The normal language syntax term is taken from microsoft/TypeScript#1678, where they finally describe trivia as "Syntax trivia represent the parts of the source text that are largely insignificant for normal understanding of the code")

For naming the Internal/Not Internal "*Syntax" classes, my suggestion would be to use *InternalSyntax for the internal ones, like FeatueInternalSyntax. Not nice, but at least avoids always checking the proper namespaces.

Add initial lossless syntax tree implementation

3378ef6

Code-Grump added 2 commits February 11, 2025 15:06

Split ParsedSyntaxTreeBuilder into file-per-rule-handler

448896a

Added an example using the syntax tree

e357bda

Also fixes a bug iterating child nodes and tokens

gasparnagy reviewed Feb 19, 2025

View reviewed changes

Code-Grump added 2 commits February 20, 2025 12:32

Add basic diagnostics support

867e9d7

Upgrade to Gherkin 32.0.0

5794a9e

		@@ -0,0 +1,22 @@
		// Licensed to the .NET Foundation under one or more agreements.


		namespace Reqnroll.CodeAnalysis.Gherkin.Syntax.Internal;

		internal class FeatureFileSyntax : RawNode


		namespace Reqnroll.CodeAnalysis.Gherkin.Syntax.Internal;

		internal class FeatureDeclarationSyntax : DeclarationSyntax


		using static InternalSyntaxFactory;

		internal partial class ParsedSyntaxTreeBuilder


		using static SyntaxFactory;

		public partial class ParsingTests

Add initial lossless syntax tree implementation #443

Are you sure you want to change the base?

Add initial lossless syntax tree implementation #443

Conversation

Code-Grump commented Feb 11, 2025

🤔 What's changed?

⚡️ What's your motivation?

🏷️ What kind of change is this?

♻️ Anything particular you want feedback on?

📋 Checklist:

Code-Grump commented Feb 11, 2025

gasparnagy commented Feb 11, 2025

Code-Grump commented Feb 11, 2025

gasparnagy commented Feb 11, 2025

Code-Grump commented Feb 12, 2025

Code-Grump commented Feb 12, 2025

gasparnagy commented Feb 12, 2025

Code-Grump commented Feb 12, 2025

Code-Grump commented Feb 16, 2025

gasparnagy left a comment

Choose a reason for hiding this comment

General Impression

Testing

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Code-Grump commented Feb 20, 2025 • edited Loading

gasparnagy commented Feb 21, 2025

Code-Grump commented Feb 20, 2025 •

edited

Loading