Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Allow nested interpolated strings in lexer #4055

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

max-sixty
Copy link
Member

An 80% complete attempt to allow having an interpolated string contain expressions — #3279

Supersedes #3280

Currently it successfully lexes s"{hello + 5}" into a nested expression of tokens.

But I'm not sure how to then parse those nested tokens — chumsky possibly doesn't support this sort of thing, or only a very basic version with https://docs.rs/chumsky/latest/chumsky/stream/struct.Stream.html#method.from_nested. I tried a lot of things and couldn't make progress (this was much harder than I thought it would be...)

It also required refactoring the existing lexer into a function that outputs individual tokens. It uses some rewind parsers to know when we've reached the end of the parent.

Any ideas for making this work? Have I overcomplicated it?


FYI most of the existing tests pass, because I tried to have the proposed interpolation go back to its string form. There are some corner case failures because of the way that spans work.

@max-sixty max-sixty requested a review from aljazerzen January 6, 2024 19:14
@max-sixty
Copy link
Member Author

max-sixty commented Jan 6, 2024

Possibly another approach is to instead produce InterpolationStart and InterpolationEnd tokens, and then avoid any token nesting from the lexer.

...which would be how we currently produce parentheses (but not how we produce strings — those return a Literal::String, rather than StringStart and StringEnd)

@aljazerzen
Copy link
Member

I tried this implementation strategy when rewriting from Pest to chumsky, but my bottom line was the same: the parser expects a flat stream of tokens and not a tree.

This PR is a good effort, but I don't know how to continue from here. The from_nested might work, but even if it does, having nested token streams is quite uncommon.

I like your idea of InterpolationStart and InterpolationEnd much more!

So the following:

a = f"hello {"Mr. " + name}, I'm {age} years old"

... would be lexed as:

Ident: "a"
Ctrl: =
StringQualifier: "f"
String: "hello "
InterpolationStart  <--- this means {
String: "Mr. "
Ctrl: +
Ident: "name"
InterpolationEnd
String: ", I'm "
InterpolationStart
Ident: age
InterpolationEnd
String: " years old"
StringQualifier: ""

... or maybe like this:

Ident: "a"
Ctrl: =
InterpolationStart: "f"  <--- this means f"
InterpolationString: "hello "
String: "Mr. "
Ctrl: +
Ident: "name"
InterpolationString: ", I'm "
Ident: age
InterpolationString: " years old"
InterpolationEnd

// TODO: decide how we want to handle colons in interpolated expressions
// We use rewinds to look ahead and ensure we don't have a closing
// bracket (or colon), before forwarding that to the lexer.
filter(|c| *c != '}' && *c != ':')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you aware that this prevents interpolations from containing }?

Following is not supported:

f"result: { my_func_that_takes_a_tuple {a=1} }"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I was aware of the issue with colons at least — python also has this problem but doesn't use colons so much.

With closing brackets, that's indeed a more restrictive case, and I agree with your broader point about it being confusing if we don't allow that, given how prominently brackets are used.

One difficulty is that at the lexing stage we don't know about matching opening & closing brackets, so we can't do the thing of "it's the end of the interpolated string if it's the final closing bracket".

We could require escape characters, which seems reasonable if a bit ugly.

prqlc/prqlc-parser/src/lexer.rs Outdated Show resolved Hide resolved
),
[],
)
"###);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test for the following:

s"LOWER_CASE( {"contents"} )"

and:

s"LOWERCASE( { f"Hello, {"Aljaz"} is my name." } )"

@vanillajonathan
Copy link
Collaborator

Meh, are nested string interpolation even something that is useful?
Sounds like something that needlessly complicates the parser and that nobody would ever actually use or care about.

@max-sixty
Copy link
Member Author

Meh, are nested string interpolation even something that is useful?
Sounds like something that needlessly complicates the parser and that nobody would ever actually use or care about.

This is a good question!

It depends what form of nesting you're asking about. From the linked issue, this seems helpful, and arguably confusing that we don't allow it:

from invoices
derive x = s"{foo - bar} + 5"  # currently fails

OTOH, having s & f strings inside other s & f strings I agree isn't that helpful.

@aljazerzen
Copy link
Member

OTOH, having s & f strings inside other s & f strings I agree isn't that helpful.

I do agree, although there are cases where it would be useful:

...
f"my items are { std.array.join this.items "," }"

I worry about language consistency around interpolation:

  • if we support only idents, this is not ergonomic, but a reasonable limitation,
  • if we support all expressions, this is very ergonomic, but hard to pull off,
  • if we support all expressions except string literals and tuples, it gets 90% of ergonomics, but is also quite arbitrary rule that is hard to remember.

I'm not against this incremental improvement, but I would like to have a long-term plan on what we plan on supporting.

@aljazerzen
Copy link
Member

Ok, I'm impressed with Pablo's work - Python is able to parse this:

>>> f"My dict: { {'a': 1} }"
"My dict: {'a': 1}"

@aljazerzen
Copy link
Member

I've thought this was hard, not I think this is harder. :D

@max-sixty
Copy link
Member Author

Python is able to parse this:

But not this!

f"{lambda x: x+2}"

max-sixty added a commit to max-sixty/prql that referenced this pull request Jan 16, 2024
@max-sixty max-sixty force-pushed the interpolate-lexer branch from bce69f9 to b649696 Compare March 6, 2024 23:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants