-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add initial lossless syntax tree implementation #443
base: main
Are you sure you want to change the base?
Conversation
I've created this PR at this stage for feedback on what exists so far. I would suggest the following types should be given the most attention:
The codebase intends to emulate the Microsoft.CodeAnalysis concepts for familiarity, using the same concepts of tokens, trivia and syntax nodes generally. The public API should not be considered complete (there are many [obvious additions that could be made) but should be considered representative of the core behaviour desired by the code generators. The current implementation is able to parse a feature file with a feature declaration, but no scenarios. I've added some support for handling errors, but there's a lot more work that needs to be done in that space to handle every kind of failure. Through creating the structures for representing a feature file and the single feature declaration, I can see a lot of opportunity to generate code. Once I get a sense that we're happy with this structure and would like to proceed, I'll investigate how we might generate all the desired structures following the same patterns. |
Thank you. I like the approach of making small steps towards our generation goals. 🤘 To get a design overview of the change, do I see it correctly, that this change basically uses the Gherkin parser, but provides an alternative AST builder and token scanner/matcher that fits to the purpose, correct? Also if I understand correctly this is now an alternative parser that is currently not being used yet, but would be used for Roslyn code gen and also potentially for the legacy CodeDom based generation as well. How do you see that? I'm very bad in reviewing things that does not yet have a real usage. 😎 Would it be possible to add some simplistic tool (maybe a basic linter or whatever that is the easiest) that could demonstrate the use of the new parser? It does not have to be something we release. That would give probably a few good input for the design (at least for me). |
That's right: there's an implementation of
It's intended to be used for Roslyn-based generators, but could be used in any generation space. It could also be very valuable in all IDE plug-ins, where performing code refactoring would be possible by modifying the syntax tree and writing the result back to the source file: because we preserve whitespace, changes respect the source document format and can be performed with a surgical precision.
Naturally, I could create a Roslyn-based generator that consumes the parsed result, but that only reads the tree and projects into another format. If you want something that performs more complex analysis or modification, I could imagine something up or would be happy to have some concrete tasks. |
I see. The goal would be to get an impression on how easy it is to work and manipulate the AST. But this sample usage is only throw-away code, so we should not put much effort into that. Maybe we could make one that transforms the feature file to a markdown document. It doesn't have to be wrapped to a Roslyn-based generator, but you have mentioned "incremental parsing" and I have no idea how we could verify that (I mean whether the design really supports incremental parsing). But just guessing here. We can even drop this "sample usage" idea if it is too overwhelming. |
Also fixes a bug iterating child nodes and tokens
I've dropped in a single example into the associated test project which demonstrates a hypothetical analysis: detecting a lack of a blank line between the feature headline and its associated description. It doesn't go into detail of how you could provide a "fix" for this error, but one could be provided if useful. |
Regarding incremental parsing support: there's two things the design of this model does to support this future desire. Firstly, the raw nodes do not have any concept of "ownership"; they can be aggregated by multiple parent nodes to appear in multiple valid syntax trees simultaneously. Combined with their immutability, this makes them safe to be re-used between parses. Secondly, equality checks in the public syntax tree are made fast by doing reference-based checks on the raw nodes. This means that if we were to re-use nodes between incremental parses, a consumer is able to spot just the new nodes very cheaply. This is the primary benefit of incremental parsing for the consumer: an ability to detect changes and only process a subset of the total tree. Additionally, because the tree includes all input text, it makes it trivial to understand which portion of the tree is directly affected by a change in source text: we just get the span that has been modified and find what nodes that span crosses; all nodes touched have to be parsed again and untouched nodes are safe to re-use. |
Maybe it's not relevant, but how is this related to line positions? What happens if an extra line inserted to the top of the file (so everything is shifted)? Will everything else below become "invalid"? |
Great question! This scenario invalidates all touched positions, but explicit positions are only in the public model: they are calculated from the width of raw tokens and trivia. This means we can look at a change like this, process a content insertion, and re-use the raw nodes which have been "pushed around". This model can re-use individual tokens or whole subtrees whilst accepting surrounding content changes. |
If possible, I'd like to know if there's any feedback on the overall design at this stage. My next step is to implement diagnosis support and then I want to code-generate the syntax class (since the code is all boilerplate,) so it'd be great to understand if there's anything to correct or present alternatives for. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx for the prototype implementation. It is really impressive, and I haven't found any conceptual issue with that.
The analyzer example is useful. It demonstrates the usage well, and in general it is easy to understand. Nevertheless it also shows (should not be surprising) that you need to know about the Gherkin syntax and the parser terminology pretty well: I mean the part where you need to separate the different trivia and try to find a new line between potential comments... Also you need to know the difference between whitespace and end-of-line trivia. This is not a problem, just a fact.
I have made a couple of in-file comments. They are more smaller or code style related issues. Nothing serious.
I have also captured some general notes:
General Impression
It is clear right from the beginning that the code follows the Roslyn parsing school. Microsoft has invented enough effort into that, so I don't think there is too much to think whether this is good or not - anyway our goal was to make the thing easier to use for Roslyn generators/analyzers so this is fine.
From the example my first impression was that this AST could be used just as good as the AST we have now, so in principle it could replace our AST to reduce the code duplication. This is a good news.
With all that being said for me the codebase was surprisingly huge. 83 files for parsing a feature file header... (ok, we have skeletons for other nodes too, but still) So for me the message is that to have all the abstractions we stated in the goals, we have to work a lot. All my respect to you. 🙏 Probably we cannot do too much with this.
Another thing I noticed that since it follows the Roslyn terminology it is pretty hard to understand the concepts without clarifying the terminology first. I have a CS degree, attended parser generator lectures and worked with lex/yacc, antrl, etc (but never worked with Roslyn) - so I thought it will be easy for me, but not. I saw your efforts documenting the most important classes, that was a plus. After spending ~3h with the code, this is the list of terms that I noted for myself to get better understanding about (you don't have to answer these at this point, but maybe these could be explained in a concept document for later contributors. Or just point to some Roslyn guide that is a must read first):
- trivia - I understand this term, but I have not used it in this context before
- structured trivia - ???
- syntax trivia - ???
- structured trivia syntax - ??? (you have lost me at this point)
- syntax node vs raw node - i think I understand this, but still not confident
- *Syntax vs Internal.*Syntax - i have some rough feelings, but this is still more unclear. I think we should find a better naming for one of this, because in the current form that they only differ in namespace it is very hard to track this
- *Syntax - these are the AST nodes and derive from SyntaxNode, I got that. But why they named like this and not *SyntaxNode is a mystery for me. For example "scenario declaration syntax" in English means a set of grammar rules that describe how a scenario should look like. But in Roslyn it is actually a node in the AST. Mind blowing.
- syntax token - it's a token in AST (although not deriving from Syntax Node)
- raw syntax token - ... 10 seconds ... ah. yeah.
- slot - ??? (sometimes I saw "index" being used)
- leading / trailing trivia - yet unclear whether the leading trivia of one node is the same as the trailing trivia of the previous one and if not, how to we split? In general I was not sure to which node a trivia is attached (e.g. where a comment between two steps belongs to)
Testing
It is clear that for the big amount of code, we also need a big amount of tests. The conceptual question here is how you can really test this without testing the Gherkin parser itself.
In the Gherkin project, our approach was approval testing: we have a standard set of "test files" and expected AST and tokens. As that set of test files have been proven to protect the Gherkin parser implementations in the past, I would consider reusing that for this one too.
Basically you would need to parse all sample files and write a "printer" that prints the result in the expected format so that you can compare it with the approved results. (See an example of that here). With that you could really concentrate your coded tests to the trivia handling.
/// Gets the culture to use when parsing text. This determines the language that will be used for the Gherkin keywords. | ||
/// If the document contains a language directive, the directive takes precedent. | ||
/// </summary> | ||
public CultureInfo Culture { get; private set; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to double check if CultureInfo
is a valid representation of the default Gherkin language. Most of Gherkin language dialects match to CultureInfo names, but not all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this has been an ongoing question for this project.
My gut feeling is that you won't be able to easily represent cultures like en-pirate
this way. There's a few things here that are a little bit challenging, since we probably do want some kind of culture support: it's not just the language choice for keywords, but also how we interpret numerical values, etc. (especially if we don't want to be using the process's culture information as a fallback.)
I'm open to replacing this with something else, but I'd like us to decide whether we mean "language", "culture" or "language and an implied culture", because there's already enough ambiguity going around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. We do need the culture for running the step definitions. So far the concept was (not sure we always followed it) that "language" is the language notation of Gherkin en-pirate
& co, which was used for parsing the file and the "binding culture" that was a .NET culture info for running the step definitions. While you can set the binding culture explicitly, we have a mechanism to automatically derive the binding culture from the feature language (for most of the cases).
Like everything, this can also be questioned, because it is clearly affects only a small portion of the feature file languages, but we cannot do that here in this PR because it would impact the other parser versions as well. So my suggestion is that in this PR, we should make a "compatible" parser with the new Roslyn style and apply conceptual changes in a separate step, otherwise it will be very hard to keep the things in sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if we were to reflect the current situation, the value is an identifier of the grammar language to be used in parsing and sets a default for the execution culture (with the usual fallbacks.)
It's not required to be an ISO culture code supported by the host system, so using CultureInfo
here is inappropriate. I need to choose an alternative data-type, and I'm always hesitant to just use a string when there's an expected subset of values. In this case, could any string be a language identifier (with spaces and emojis), or does it have to follow the ISO culture-code structure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you can find any better than string
. As you said it is an identifier for the Gherkin language setting listed here: https://github.com/cucumber/gherkin/blob/main/gherkin-languages.json . I don't think there is any restriction for the value itself, but so far it was something built up from Latin letters and -
, but I don't think this was ever written down (see https://github.com/cucumber/gherkin/blob/main/CONTRIBUTING.md#adding-or-updating-an-i18n-language).
But I would like to note that the names you see at CultureInfo
are not ISO standard either. They are just a fancy combination of different standards, but the way how they are combined was invented by Microsoft. https://learn.microsoft.com/en-us/dotnet/api/system.globalization.cultureinfo.name?view=net-9.0 (highlight from me)
The culture name in the format languagecode2-country/regioncode2 [...]. languagecode2 is a lowercase two-letter code as defined in ISO 639-1, or, if no two-letter code is available, a three-letter code as defined in ISO 639-3. country/regioncode2 contains a value defined in ISO 3166 and usually consists of two uppercase letters, or a BCP-47 language tag.
|
||
internal GherkinParseOptions(CultureInfo? culture = null) | ||
{ | ||
Culture = culture ?? CultureInfo.CurrentCulture; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compilation should never depend on the user's culture setting, because that causes strange behavior for teams with people from different countries and also problematic on the CI. This is why in SpecFlow/Reqnroll the overall default is en-US.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this a little odd and wasn't aware of this behaviour in SpecFlow and Reqnroll - the default fallback to current culture is pretty much standard across all .NET libraries. If the culture needs to be set for the purposes of running in a CI build, I think there's expectations to do it through appropriate configuration, whether that's at the CI level or via a tool-specific configuration file like ours.
I don't mind changing this to match the existing behaviour, but it surprises me that it is the existing behaviour and being surprised makes me want to question it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your analogy does not work fully. The analogy would be if you would need to write your floating point numbers in C# depending on the culture setting of the machine where your compile. So the C# code var f = 1.23
that compiles on yours would not compile on my machine (Hungarian culture), as you would need to write var f = 1,23
. This is obviously not the case, and this is why C# language specs is deterministically set how to set the floats. The C# compiler still has some options, e.g. a code that compiles successfully with nullable disabled might cause warnings with nullable enabled, but it does not depend on environmental settings.
The feature file handling is similar to that. The parser needs to deterministically decide if a Feature
keyword is valid or not during compilation time. If we would use the default culture than I could not compile your Reqnroll project as I have Hungarian culture set, where the feature keyword is Jellemző
. Also in our case the "binding culture" has to be deterministic, because you might write Then the result is 1.23
into the feature file that is bound to a step definition with a double
parameter. Since the feature files are representing the specification they have to be unambiguous, so it cannot happen that on my machine the test fail (because Hungarian treats .
as thousand separator, or pass, because in English it is the decimal separator).
We started SpecFlow at a company who had offices in Hungary and Austria so we faced this issue right from the beginning. (Our first implementation used the current culture, but it didn't work in cross-country projects.)
So all in all, this does not behave like a library, much more like a programming language.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Gherkin is special in this regard, because most (if not all) other programming languages are specified using a culture-agnostic system; you don't get keywords in your language of choice for C#, or parsing floating point values with your local number-system.
For Gherkin that is the expectation: use the culture that works best to write a specification and collaborate with your team.
So the suprise for me is just that the default isn't "your local culture" as it would be for other culture-sensitive tools. I appreciate that didn't work for the original authoring project, which obviously steered the choice to an explicit default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the culture that works best to write a specification and collaborate with your team [...] So the surprise for me is just that the default isn't "your local culture"
Welcome to my world. It's hard to make statistics, but in the most of the countries (the non-English speaking ones), many (the majority?) devs work on domain culture that is not the same as their local culture. In the last 20 years I never worked on a Hungarian domain for example. There have been some Hungarian clients, but those were part of (or planned to be part of) global organizations, so the project domain language was English (few cases German). But even in UK/US you would see many devs who has their Windows set for their native language/culture.
[...] as it would be for other culture-sensitive tools.
It's hard to find a good comparison and culture-sensitivity is a complex topic. You can think about Excel for example, were you want to fix references, formulas and calculations and you want that to work the same everywhere. Excel supports culture-sensitivity, but it saves the result into the file in a culture-agnostic way. So I can make an excel sheet that uses Hungarian float numbers and make a cell with the formula =ÖSSZEG(A:A)
(this is the Hungarian version of SUM
), but if you would open the file, you would see the result in UK decimal separators and the cell would show you =SUM(A:A)
.
I cannot really name any tool, where the created artifact would be culture-dependent and would be interpreted differently on machines with other cultures. That would lead to a chaos.
I think what you mean is that if our VS extension would be culture specific, the project wizard would generate a Reqnroll project with the culture of the current user. (So for me it would generate a Reqnroll project with the default language and the sample feature file using Hungarian.) Interestingly we have tried this with SpecFlow, but we got a massive resistance from the users (probably because of the reason I mentioned above) so we had to revert it back to generating English projects by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for indulging me in this topic. It's been really interesting to see how the decision was made. I think the parallels with Excel are probably the closest to what we have: parse the input in one language but encode the output in an agnostic fashion (for us, C# tests.)
Once we've decided how to properly encode the language configuration, I'll make en-US
the default here.
@@ -0,0 +1,22 @@ | |||
// Licensed to the .NET Foundation under one or more agreements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should put these marker/polyfill classes to a separate folder (.polyfill
?) to keep the project structure clean.
|
||
class SourceTokenScanner(SourceText source) : ITokenScanner | ||
{ | ||
private readonly SourceText _source = source; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have enabled using _
for the ctor parameter names in the Reqnroll project to avoid such field declarations/settings. So you can simply name the constructor parameter as _source
as long as it is private.
Reqnroll.CodeAnalysis.Gherkin/Reqnroll.CodeAnalysis.Gherkin.csproj
Outdated
Show resolved
Hide resolved
{ | ||
_cancellationToken.ThrowIfCancellationRequested(); | ||
|
||
Debug.Assert( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my experience the Debug.Assert
calls may fail the tests in a very awkward way, that the error is not shown. So I stopped using it in my projects. Maybe there is a good practice around it, but I could not figure out what.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can create our own version that fails in a more test-friendly way - the objective was to have some kind of test that won't be present in the release version, because they should be invariants checked by the public API. Happy to introduce a GherkinDebug
class or similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with that idea.
|
||
namespace Reqnroll.CodeAnalysis.Gherkin.Syntax.Internal; | ||
|
||
internal class FeatureFileSyntax : RawNode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Gherkin this is called "Gherkin document", so I would not use the term "feature file" here. (At least I was confused about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question of what to call things is absolutely an open one. I've leaned more towards the domain language I've encountered in teams writing feature files (heh, we really do call it a feature file.) If there's a domain language for the Gherkin syntax that you feel we should adopt, I'd be quite happy to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had a lots of discussions about this in the Cucumber/Gherkin community and we got to a consensus, so we should use that.
|
||
namespace Reqnroll.CodeAnalysis.Gherkin.Syntax.Internal; | ||
|
||
internal class FeatureDeclarationSyntax : DeclarationSyntax |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused about the term "feature declaration". If this will represent the entire feature (vs "feature header") then I would just call it "feature" because the term "declaration" is not used in Gherkin anyway and can be confused with "header". The related rule handler is called "FeatureRuleHandler" anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This naming is a reflection of how Microsoft describes their languages and how my teams have referred to these sections of Gherkin documents. Feature:
, Rule:
and Scenario:
are all examples of "declarations", mostly because they have a colon in them that linguistically declares something.
Additionally, the syntax structure encapsulates everything: the header of the feature, it's description and (eventually) all the rules and scenarios. I took inspiration from ClassDeclarationSyntax
when considering the structure of features.
As before, if we'd like to use a specific domain language or syntax structure, I'm open to following whatever we think best aligns with our goals, and my experience in this space is heavily skewed by Roslyn's specific quirks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like above, I would just stick to the Gherkin domain terminology. I recommend looking at the Gherkin grammar file, that clarifies all terminology related questions: https://github.com/cucumber/gherkin/blob/main/gherkin.berp
It does not make much sense to diverge. The users of the parser should be primarily domain experts in Gherkin and only secondarily in Roslyn. We have suffered a lot with every cucumber implementation naming the things differently because the Gherkin language was not standardized. Now it is, so let's be happy with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to follow that specification. There are bits that don't follow the patterns of the C# language that will make it feel a bit odd in this space, but I'm going to be much happier following a standardised structure.
Do you think berp could be used to generate any of the structures we need? My current plan was to roll out a Roslyn generator that churned out the boilerplate required for a structure by me just defining the properties. I know how Roslyn generators work, so I know how I could do this. Berp is a lot less-clear to me, but it understands the source specification.
Also, is there a way to embed documentation in that specification? The alternative will be linking in some additional external source to join up. I really want to be able to provide detailed XML comments for all publicly accessible types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Berp parses the grammar file and generates a parser but with a custom specified template file, you can generate artifacts like this as well (I believe). The only challenge is that it currently only supports Razor templates that are not so easy to maintain. We can make some experiments with this.
We use Berp in the Gherkin project in a way that it integrates to the compilation pipeline and generates the parser class that is going to be compiled into the result. So the outer view is similar to Roslyn source gen.
|
||
using static InternalSyntaxFactory; | ||
|
||
internal partial class ParsedSyntaxTreeBuilder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally do not like partial classes, but combined with nested types is really hard to track. We are already in an "Internal" namespace, so I would just promote these nested types into normal types. You can add them to a subfolder if that helps.
In the current way, as the first part of all partial class file is the same, it is too hard to find anything. For example here in GitHub I see this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few things that can make nested types better than separate types, like access to privates in the parent, but I absolutely agree about the obnoxious naming making it hard to find things. Early on I suspected I would be using a lot of code-generation going forward and started to move the partials out to separate classes to conceptually separate them from the code I expected to remain hand-written. Ironically, the ParsedSyntaxTreeBuilder
ends up being one of the classes that is more likely to need to be hand-written and I was expecting to refactor to full classes at some future point.
|
||
using static SyntaxFactory; | ||
|
||
public partial class ParsingTests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There will be many sub-aspects of parsing, I would put them into their own class, and not use partial class. In that way it will be also easier to review the test execution result (vs a big flat list).
Thanks again for the quality feedback @gasparnagy - getting direction is what I need to move this forward. Regarding the terminology you've inquired into, let me see if I can clarify:
I hope this helps clarify and generates further feedback. 😁 |
That is definitely helpful. 🙏 I have just noted that you keep using the term "syntax" even in normal text for something that only makes sense in the Roslyn bubble, and I continuously need to transcribe. 😵💫 So for example when you say "Anything that's not syntax is trivia." that is in a "normal" language does not make sense, because trivia (whitespace, comments, etc) are do part of the syntax of a language! But this just shows how bad idea was from MS to shorten "normal language syntax node" to "syntax". We'll get used to it. (The normal language syntax term is taken from microsoft/TypeScript#1678, where they finally describe trivia as "Syntax trivia represent the parts of the source text that are largely insignificant for normal understanding of the code") For naming the Internal/Not Internal "*Syntax" classes, my suggestion would be to use *InternalSyntax for the internal ones, like |
🤔 What's changed?
This adds a new project
Reqnroll.CodeAnalysis.Gherkin
which provides a lossless syntax tree and a parser implementation intended for use with code generation, code-analysis and code-fix implementations operating over the Reqnroll flavour of the Gherkin language.The syntax tree follows a fundamental rule that all input text is represented in the output tree:
Syntax tokens are grouped by syntax nodes which form the higher-order syntax structure of a feature file.
This initial implementation intends to fix the public API. The internal implementation will have room for performance improvements.
The most important detail is that all tokens, trivia and nodes are immutable and are represented in two flavours:
This model makes it very efficient to create nodes outside of a syntax tree, then insert them into one or more syntax trees, or copy nodes between syntax trees - the raw nodes do not maintain a reference to their parents, so they can appear in multiple hierarchies. This will also make it considerably easier and efficient to implement any kind of incremental parsing in the future, where nodes that are unaffected by a source change can be copied into the new tree.
⚡️ What's your motivation?
This model will provide the specific positional information necessary to emitted code statements in Roslyn generators to their Gherkin source. It will also make it easier to implement more complex transforms, such as embedding the source document into Gherkin Messages. Finally, it will provide a simple and efficient model for driving incremental generation, which can be improved by adding incremental parsing as a future enhancement without needing to change the generator logic.
🏷️ What kind of change is this?
♻️ Anything particular you want feedback on?
📋 Checklist:
This text was originally taken from the template of the Cucumber project, then edited by hand. You can modify the template here.