Skip to content

Conversation

@albertlockett
Copy link
Member

@albertlockett albertlockett commented Jan 7, 2026

closes #1728

In #1722 we added an if/else expression to the KQL parser that may not be supported by all engines that execute some KQL query.

Seeing as this is a departure from standard KQL, and there may be other new expressions added in the future (such as route_ro), a need was identified to have a common base grammar with the ability to add/parse language extensions. This PR implements this capability.

The pest grammar is split into base.pest and kql.pest, and the KQL parser now uses both these grammar files using pest's "load multiple grammars" feature.

A second parser is added for a hypothetical, KQL-inpisred query-language that will support the if/else expression and other future extensions. This is added as a new crate under rust/otal-dataflow/crates/opl. It also uses base.pest, and another grammar file called opl.pest where future language extensions will be added.

Making the parser functions generic:

There are many parser utility functions in the kql-parser crate that ideally all KQL derived parses could use. Inside these functions, there are many checks for which variant of the Rule enum (derived by pest_derive::Parser) is being handled. For example:

Rule::conditional_unary_expressions => parse_conditional_unary_expressions(rule, scope)?,
Rule::conversion_unary_expressions => parse_conversion_unary_expressions(rule, scope)?,
Rule::string_unary_expressions => parse_string_unary_expressions(rule, scope)?,
Rule::parse_unary_expressions => parse_parse_unary_expressions(rule, scope)?,
Rule::array_unary_expressions => parse_array_unary_expressions(rule, scope)?,
Rule::math_unary_expressions => parse_math_unary_expressions(rule, scope)?,
Rule::temporal_unary_expressions => parse_temporal_unary_expressions(rule, scope)?,
Rule::logical_unary_expressions => parse_logical_unary_expressions(rule, scope)?,
Rule::extract_json_expression => {

One challenge in making these functions generic is that they need to operate on a concrete enum type, and pest_derive::Parser proc macro generates a different Rule enum for every parser.

The solution in this PR is to derive a base pest_derive::Parser (and hence, the associated Rule enum), for the base.pest grammar. This is done in kql-parser/src/base_parser.rs. The parser utility functions for each expression are then made generic over a type of pest Rule that can be converted into base_parser::Rule. A trait is provided called TryAsBaseRule that encapsulates the conversion logic, so in this PR we see many functions changed to take a generic R where pest::iterators::Pair<'_, R> implements TryAsBaseRule.

Implementing the TryInto for some derived Rule into base_parser::Rule would be somewhat tedious to do by hand, because we'd need to write a match with a branch for every variant of the enum, and update these conversion functions each time we add a new rule to base.pest. To avoid this, a procedural macro is created to generate this conversion code. The macro lives in kql-parser/src/macros, and parsers simply need to use this proc macro to make their rules compatible with the generic parser functions:

#[derive(Parser, BaseRuleCompatible)]
#[grammar = "base.pest"]
#[grammar = "kql.pest"]
pub(crate) struct KqlPestParser;

One additional challenge related to parser functions generic is that the scalar_expression::parse_scalar_expression uses a PrattParser, which takes the Pair as its argument. Pest doesn't provide any simple way convert a Pair<'_, R> to Pair<'_, base_parser::Rule>, which means that the generic parse_scalar_expression function also needs a way to generically create a PrattParser that accepts Pair<'_, R>.

To handle that challenge, the BaseRuleCompatible procedural macro also creates a PrattParser for the derived Rule, and implements a trait for the Rule enum that can be used to access the PrattParser. The trait is called ScalarExprRules, and so this trait bound is also added to the generic parser function signatures.

@albertlockett albertlockett requested a review from a team as a code owner January 7, 2026 17:12
@github-actions github-actions bot added rust Pull requests that update Rust code query-engine Query Engine / Transform related tasks query-engine-kql KQL usage of Query Engine labels Jan 7, 2026
@codecov
Copy link

codecov bot commented Jan 7, 2026

Codecov Report

❌ Patch coverage is 95.58600% with 29 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.07%. Comparing base (32a6fbb) to head (f608b86).
⚠️ Report is 18 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1734      +/-   ##
==========================================
- Coverage   84.08%   84.07%   -0.02%     
==========================================
  Files         469      474       +5     
  Lines      137651   136635    -1016     
==========================================
- Hits       115746   114876     -870     
+ Misses      21371    21225     -146     
  Partials      534      534              
Components Coverage Δ
otap-dataflow 85.32% <81.05%> (-0.03%) ⬇️
query_abstraction 80.61% <ø> (ø)
query_engine 90.42% <98.04%> (+0.03%) ⬆️
syslog_cef_receivers ∅ <ø> (∅)
otel-arrow-go 53.50% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@lquerel lquerel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

But before merging, I would like to get feedback from @CodeBlanch and @drewrelmas given the scale of the impact on KQL.

Copy link
Contributor

@drewrelmas drewrelmas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not had sufficient bandwidth to do an in-depth review of this change yet, but my initial impression is that this is the right direction! I would definitely prefer moving forward with this first over #1722.

I think this is some confirmation that the expression tree / IL approach we took initially will pay off and allow any future extension to other languages. I envision the eventual processor to allow selection of parser in input configuration (or multiple 'contrib' processors built around specific parsers that leverage the same engine internally) to meet different end-user requirements.

@albertlockett albertlockett marked this pull request as draft January 7, 2026 22:26
@albertlockett
Copy link
Member Author

Converting to draft after feedback from Jan 7th SIG meeting - this needs some refactoring

@albertlockett
Copy link
Member Author

albertlockett commented Jan 8, 2026

On the Jan 7th SIG call, we came to the conclusion that sharing the Pest grammar was untenable and it would be best if an OPL parser had its own Pest grammar.

There was thinking at the time that maybe we could still share the parser code, which translates the Pest rules into our expression AST. On the surface, this seems desirable because there's a lot of code in the kql-parser crate that would need to be duplicated otherwise.

However as I dig into this more, I'm realizing that without a shared grammar, sharing this parser code is probably not the best approach, and OPL should probably just implement its own parser. In the paragraphs below I'll explain why.

In the shared parser code model, I imagine we'll make our parser utilities accept derived Rules from either the KQL or OPL parser (as was implemented in this PR). Effectively, this would mean that we implement TryInto<kql_parser::Rule> for opl_parser::Rule, which simply converts one Rule to the other for the enum variant with the same name. Our parser functions are generic over R: TryInto<kql_parser::Rule>.

What's not captured anywhere in this scheme is the hierarchical relationship between the rules and how the parser handles them, and this could result in subtle bugs.

For example, consider how kql-parser parses null literals. The rule hierarchy looks like scalar_expression -> scalar_unary_expression -> type_unary_expression -> null_literal. Accordingly, when parsing we call scalar_expression::parse_scalar_unary_expression which matches the rule to type_unary_expression which then calls scalar_primitive_expressions::parse_type_unary_expression, which then handles each typed scalar variant:

pub(crate) fn parse_scalar_expression(
scalar_expression_rule: Pair<Rule>,
scope: &dyn ParserScope,
) -> Result<ScalarExpression, ParserError> {
PRATT_PARSER
.map_primary(|primary| match primary.as_rule() {
Rule::scalar_unary_expression => parse_scalar_unary_expression(primary, scope),

pub(crate) fn parse_type_unary_expressions(
type_unary_expressions_rule: Pair<Rule>,
) -> Result<StaticScalarExpression, ParserError> {
let rule = type_unary_expressions_rule.into_inner().next().unwrap();
Ok(match rule.as_rule() {
Rule::null_literal => parse_standard_null_literal(rule),
Rule::real_expression => parse_real_expression(rule)?,
Rule::datetime_expression => parse_datetime_expression(rule)?,
Rule::time_expression => parse_timespan_expression(rule)?,
Rule::regex_expression => parse_regex_expression(rule)?,
Rule::dynamic_expression => parse_dynamic_expression(rule)?,
Rule::true_literal | Rule::false_literal => parse_standard_bool_literal(rule),
Rule::double_literal => parse_standard_double_literal(rule, None)?,
Rule::integer_literal => parse_standard_integer_literal(rule)?,
Rule::string_literal => parse_string_literal(rule),
_ => panic!("Unexpected rule in type_unary_expressions: {rule}"),
})
}

Let's say for some reason that KQL needs needs to reorganize its grammar, and null_literal becomes a child of scalar_unary_expression:

diff --git a/rust/experimental/query_engine/kql-parser/src/kql.pest b/rust/experimental/query_engine/kql-parser/src/kql.pest
index c2bc4ced..9d65cd1f 100644
--- a/rust/experimental/query_engine/kql-parser/src/kql.pest
+++ b/rust/experimental/query_engine/kql-parser/src/kql.pest
@@ -134,8 +134,7 @@ dynamic_map_expression = { "{" ~ (dynamic_map_item_expression ~ ("," ~ dynamic_m
 dynamic_inner_expression = _{ dynamic_array_expression|dynamic_map_expression|type_unary_expressions }
 dynamic_expression = { "dynamic" ~ "(" ~ dynamic_inner_expression ~ ")" }
 type_unary_expressions = {
-    null_literal
-    | real_expression
+    real_expression
     | datetime_expression
     | time_expression
     | regex_expression
@@ -237,7 +236,8 @@ backwards. For example if integer_literal is defined before time_expression "1h"
 would be parsed as integer_literal(1) and the remaining "h" would be fed into
 the next rule. */
 scalar_unary_expression = {
-    type_unary_expressions
+    null_literal
+    | type_unary_expressions
     | get_type_expression
     | conditional_unary_expressions
     | conversion_unary_expressions
diff --git a/rust/experimental/query_engine/kql-parser/src/scalar_expression.rs b/rust/experimental/query_engine/kql-parser/src/scalar_expression.rs
index 0f45a327..36d56325 100644
--- a/rust/experimental/query_engine/kql-parser/src/scalar_expression.rs
+++ b/rust/experimental/query_engine/kql-parser/src/scalar_expression.rs
@@ -296,6 +296,9 @@ pub(crate) fn parse_scalar_unary_expression(
     let rule = scalar_unary_expression_rule.into_inner().next().unwrap();
 
     Ok(match rule.as_rule() {
+        Rule::null_literal => {
+            ScalarExpression::Static(parse_standard_null_literal(rule))
+        }
         Rule::type_unary_expressions => {
             ScalarExpression::Static(parse_type_unary_expressions(rule)?)
         }
diff --git a/rust/experimental/query_engine/kql-parser/src/scalar_primitive_expressions.rs b/rust/experimental/query_engine/kql-parser/src/scalar_primitive_expressions.rs
index 58db3601..6a33990a 100644
--- a/rust/experimental/query_engine/kql-parser/src/scalar_primitive_expressions.rs
+++ b/rust/experimental/query_engine/kql-parser/src/scalar_primitive_expressions.rs
@@ -16,7 +16,6 @@ pub(crate) fn parse_type_unary_expressions(
     let rule = type_unary_expressions_rule.into_inner().next().unwrap();
 
     Ok(match rule.as_rule() {
-        Rule::null_literal => parse_standard_null_literal(rule),
         Rule::real_expression => parse_real_expression(rule)?,
         Rule::datetime_expression => parse_datetime_expression(rule)?,
         Rule::time_expression => parse_timespan_expression(rule)?,

This works, and all the kql-parser tests will pass. However, if OPL had been using the same organization of its grammar rules, the parsing using the shared code would fail for OPL unless it also made the same adjustment to its grammar (not only would it fail, it would panic at scalar_primitive_expressions.rs::28).

This brings me to my first point, which is that without a shared grammar either:

  • a) the organization of the rules becomes an immutable contract between kql-parser and crates that share its parser code. We'd probably consider this is untenable as it places undue restrictions on kql-parser's ability to adapt its own grammar
  • b) OPL parser needs to have its own set of parser tests to catch these issues, which means that all the test cases from kql-parser get duplicated anyway even if we share the parser code.

The second difficulty I see in sharing this parser code is that OPL may wish to make modifications to expressions relatively deep within the expression tree. For example, OPL might wish to support untyped null literal (whereas KQL requires parsing null as string(null)), or string interpolation..

When parsing strings, for example, we wind up in a call stack like:

...
scalar_expression::parse_scalar_expression
scalar_expression::parse_scalar_unary_expression
scalar_primitive_expressions::parse_type_unary_expressions
scalar_primitive_expressions::parse_string_literal

and at the bottom of this call stack, we need to call some custom OPL specific string parsing code. To accommodate this parse_string_literal could becomes generic over the Rule which implements some trait for string parsing. From the perspective of kql-parser crate this adds a complexity, especially as more trait methods are added for custom parsing behaviour. Note that the more custom parsing behaviour that is introduced, the more complex it becomes to support and the more dubious are the benefits of sharing the parser code in the first place. At the point it stops being worth it, the complexity actually makes it harder back out.

TL;DR - without a shared grammar, sharing the parser code would lead to a brittle parser implementation OPL unless it implements its own test suite (which means half the code from kql-parser kind of gets duplicated anyway) and also introduces extra complexity into the kql-parser to support custom behaviour for certain types. Given these drawbacks, and the benefits/desire to be masters of our own destiny, I propose OPL just implement its own parser.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

query-engine Query Engine / Transform related tasks query-engine-kql KQL usage of Query Engine rust Pull requests that update Rust code

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

KQL Parsing support different 'flavors' of language when parsing

3 participants