-
Notifications
You must be signed in to change notification settings - Fork 66
Refactor KQL parser into common base and language extensions #1734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1734 +/- ##
==========================================
- Coverage 84.08% 84.07% -0.02%
==========================================
Files 469 474 +5
Lines 137651 136635 -1016
==========================================
- Hits 115746 114876 -870
+ Misses 21371 21225 -146
Partials 534 534
🚀 New features to boost your workflow:
|
lquerel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
But before merging, I would like to get feedback from @CodeBlanch and @drewrelmas given the scale of the impact on KQL.
drewrelmas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not had sufficient bandwidth to do an in-depth review of this change yet, but my initial impression is that this is the right direction! I would definitely prefer moving forward with this first over #1722.
I think this is some confirmation that the expression tree / IL approach we took initially will pay off and allow any future extension to other languages. I envision the eventual processor to allow selection of parser in input configuration (or multiple 'contrib' processors built around specific parsers that leverage the same engine internally) to meet different end-user requirements.
|
Converting to draft after feedback from Jan 7th SIG meeting - this needs some refactoring |
|
On the Jan 7th SIG call, we came to the conclusion that sharing the Pest grammar was untenable and it would be best if an OPL parser had its own Pest grammar. There was thinking at the time that maybe we could still share the parser code, which translates the Pest rules into our expression AST. On the surface, this seems desirable because there's a lot of code in the kql-parser crate that would need to be duplicated otherwise. However as I dig into this more, I'm realizing that without a shared grammar, sharing this parser code is probably not the best approach, and OPL should probably just implement its own parser. In the paragraphs below I'll explain why. In the shared parser code model, I imagine we'll make our parser utilities accept derived What's not captured anywhere in this scheme is the hierarchical relationship between the rules and how the parser handles them, and this could result in subtle bugs. For example, consider how kql-parser parses null literals. The rule hierarchy looks like otel-arrow/rust/experimental/query_engine/kql-parser/src/scalar_expression.rs Lines 69 to 75 in 9b0aa84
otel-arrow/rust/experimental/query_engine/kql-parser/src/scalar_primitive_expressions.rs Lines 13 to 31 in 9b0aa84
Let's say for some reason that KQL needs needs to reorganize its grammar, and diff --git a/rust/experimental/query_engine/kql-parser/src/kql.pest b/rust/experimental/query_engine/kql-parser/src/kql.pest
index c2bc4ced..9d65cd1f 100644
--- a/rust/experimental/query_engine/kql-parser/src/kql.pest
+++ b/rust/experimental/query_engine/kql-parser/src/kql.pest
@@ -134,8 +134,7 @@ dynamic_map_expression = { "{" ~ (dynamic_map_item_expression ~ ("," ~ dynamic_m
dynamic_inner_expression = _{ dynamic_array_expression|dynamic_map_expression|type_unary_expressions }
dynamic_expression = { "dynamic" ~ "(" ~ dynamic_inner_expression ~ ")" }
type_unary_expressions = {
- null_literal
- | real_expression
+ real_expression
| datetime_expression
| time_expression
| regex_expression
@@ -237,7 +236,8 @@ backwards. For example if integer_literal is defined before time_expression "1h"
would be parsed as integer_literal(1) and the remaining "h" would be fed into
the next rule. */
scalar_unary_expression = {
- type_unary_expressions
+ null_literal
+ | type_unary_expressions
| get_type_expression
| conditional_unary_expressions
| conversion_unary_expressions
diff --git a/rust/experimental/query_engine/kql-parser/src/scalar_expression.rs b/rust/experimental/query_engine/kql-parser/src/scalar_expression.rs
index 0f45a327..36d56325 100644
--- a/rust/experimental/query_engine/kql-parser/src/scalar_expression.rs
+++ b/rust/experimental/query_engine/kql-parser/src/scalar_expression.rs
@@ -296,6 +296,9 @@ pub(crate) fn parse_scalar_unary_expression(
let rule = scalar_unary_expression_rule.into_inner().next().unwrap();
Ok(match rule.as_rule() {
+ Rule::null_literal => {
+ ScalarExpression::Static(parse_standard_null_literal(rule))
+ }
Rule::type_unary_expressions => {
ScalarExpression::Static(parse_type_unary_expressions(rule)?)
}
diff --git a/rust/experimental/query_engine/kql-parser/src/scalar_primitive_expressions.rs b/rust/experimental/query_engine/kql-parser/src/scalar_primitive_expressions.rs
index 58db3601..6a33990a 100644
--- a/rust/experimental/query_engine/kql-parser/src/scalar_primitive_expressions.rs
+++ b/rust/experimental/query_engine/kql-parser/src/scalar_primitive_expressions.rs
@@ -16,7 +16,6 @@ pub(crate) fn parse_type_unary_expressions(
let rule = type_unary_expressions_rule.into_inner().next().unwrap();
Ok(match rule.as_rule() {
- Rule::null_literal => parse_standard_null_literal(rule),
Rule::real_expression => parse_real_expression(rule)?,
Rule::datetime_expression => parse_datetime_expression(rule)?,
Rule::time_expression => parse_timespan_expression(rule)?,This works, and all the kql-parser tests will pass. However, if OPL had been using the same organization of its grammar rules, the parsing using the shared code would fail for OPL unless it also made the same adjustment to its grammar (not only would it fail, it would panic at scalar_primitive_expressions.rs::28). This brings me to my first point, which is that without a shared grammar either:
The second difficulty I see in sharing this parser code is that OPL may wish to make modifications to expressions relatively deep within the expression tree. For example, OPL might wish to support untyped null literal (whereas KQL requires parsing null as When parsing strings, for example, we wind up in a call stack like: and at the bottom of this call stack, we need to call some custom OPL specific string parsing code. To accommodate this TL;DR - without a shared grammar, sharing the parser code would lead to a brittle parser implementation OPL unless it implements its own test suite (which means half the code from kql-parser kind of gets duplicated anyway) and also introduces extra complexity into the kql-parser to support custom behaviour for certain types. Given these drawbacks, and the benefits/desire to be masters of our own destiny, I propose OPL just implement its own parser. |
closes #1728
In #1722 we added an
if/elseexpression to the KQL parser that may not be supported by all engines that execute some KQL query.Seeing as this is a departure from standard KQL, and there may be other new expressions added in the future (such as
route_ro), a need was identified to have a common base grammar with the ability to add/parse language extensions. This PR implements this capability.The pest grammar is split into
base.pestandkql.pest, and the KQL parser now uses both these grammar files using pest's "load multiple grammars" feature.A second parser is added for a hypothetical, KQL-inpisred query-language that will support the
if/elseexpression and other future extensions. This is added as a new crate underrust/otal-dataflow/crates/opl. It also usesbase.pest, and another grammar file calledopl.pestwhere future language extensions will be added.Making the parser functions generic:
There are many parser utility functions in the
kql-parsercrate that ideally all KQL derived parses could use. Inside these functions, there are many checks for which variant of theRuleenum (derived bypest_derive::Parser) is being handled. For example:otel-arrow/rust/experimental/query_engine/kql-parser/src/scalar_expression.rs
Lines 353 to 361 in 5d48012
One challenge in making these functions generic is that they need to operate on a concrete enum type, and
pest_derive::Parserproc macro generates a differentRuleenum for every parser.The solution in this PR is to derive a base
pest_derive::Parser(and hence, the associatedRuleenum), for thebase.pestgrammar. This is done inkql-parser/src/base_parser.rs. The parser utility functions for each expression are then made generic over a type of pestRulethat can be converted intobase_parser::Rule. A trait is provided calledTryAsBaseRulethat encapsulates the conversion logic, so in this PR we see many functions changed to take a genericRwherepest::iterators::Pair<'_, R>implementsTryAsBaseRule.Implementing the
TryIntofor some derivedRuleintobase_parser::Rulewould be somewhat tedious to do by hand, because we'd need to write a match with a branch for every variant of the enum, and update these conversion functions each time we add a new rule to base.pest. To avoid this, a procedural macro is created to generate this conversion code. The macro lives inkql-parser/src/macros, and parsers simply need to use this proc macro to make their rules compatible with the generic parser functions:One additional challenge related to parser functions generic is that the
scalar_expression::parse_scalar_expressionuses aPrattParser, which takes thePairas its argument. Pest doesn't provide any simple way convert aPair<'_, R>toPair<'_, base_parser::Rule>, which means that the genericparse_scalar_expressionfunction also needs a way to generically create aPrattParserthat acceptsPair<'_, R>.To handle that challenge, the
BaseRuleCompatibleprocedural macro also creates aPrattParserfor the derivedRule, and implements a trait for theRuleenum that can be used to access thePrattParser. The trait is calledScalarExprRules, and so this trait bound is also added to the generic parser function signatures.