Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite the parser without Jison #94

Open
RubenVerborgh opened this issue Nov 6, 2019 · 14 comments
Open

Rewrite the parser without Jison #94

RubenVerborgh opened this issue Nov 6, 2019 · 14 comments

Comments

@RubenVerborgh
Copy link
Owner

It's simply too slow.

@jacoscaz
Copy link
Contributor

Interesting benchmark: https://chevrotain.io/performance/

@jacoscaz
Copy link
Contributor

@RubenVerborgh any reason to prefer parser generators that do not require a code generation step (e.g. Chevrotain), or would you be ok with code generation (e.g. ANTLR4)?

@rubensworks
Copy link
Collaborator

Aha, I wasn't aware of this issue.
As a coincidence, our new colleague @jitsedesmet just started investigating this a couple of days ago.

@RubenVerborgh Jitse will gather your thoughts about this soon.
@jacoscaz If you would like to be involved in this effort, do let us know :-)

@RubenVerborgh
Copy link
Owner Author

RubenVerborgh commented Oct 24, 2024

@jacoscaz No reason whatsoever, back in the day jison was just the quickest thing that helped me push this out.
I always wanted to hand-roll a parser (like with N3.js), so the jison code was only meant to last one summer anyway.

That summer was 10 years ago last August 😂

@jacoscaz
Copy link
Contributor

jacoscaz commented Oct 24, 2024

Hello @RubenVerborgh and @rubensworks (and @jitsedesmet)!

If someone's already looking at this from your side @rubensworks it's probably a bad idea for me to get involved as (I think?) you have the advantage of geographical proximity. I'm also much less competent in SPARQL; I should be considered a last resort :)

I came back to this as, coincidentally, I've recently worked on a completely unrelated project that also required a parser; taking care of this one too would not have been that great of a context switch. We went with ANTLR4 after looking at quite a few metric, mainly performance but also long-term maintainability, and we're really happy with the outcome.

@jitsedesmet
Copy link

Hey @jacoscaz you already know my work though ;)

I started with Chevrotain, mainly for the performance, but I like the other benefits like expendability, fault tolerance and those visualizations look nice too. That being said, I have not yet looked at ANTLR4 beyond the code example implemented by chevrotain, so I will look closer tomorrow :D

My first impression of it was "oh but then you have new syntax".
(Being in a few hours into writing chevrotain code, I start to doubt that maybe I want new syntax xD)

@jacoscaz
Copy link
Contributor

jacoscaz commented Oct 24, 2024

you already know my work though ;)

That's precisely why I should be considered a last resort, your domain expertise is vastly greater than mine @jitsedesmet!

As for Chevrotain, we found it to be much faster than everything else on V8 (Node.js, Chrome and Chromium-based browsers) but we found ANTLR4 to perform much more consistently across different runtimes, even though roughly 1.5~ slower in absolute terms than Chevrotain on V8. However, we also did notice a high degree of specificity to the grammar we were using, with different combinations of grammar and runtime leading to significant performance differences.

If there's a subset of SPARQL that manages to be small enough so that developing the respective grammar doesn't take too long while still remaining representative of the entirety of SPARQL from a complexity perspective I would try to get that going in ANTLR4, Chevrotain and a few others and see how they fare.

EDIT: even now, running the Chevrotain benchmarks linked above in Safari on a MacBook Pro M1 has ANTLR4 almost twice as fast as Chevrotain whereas running in Chrome on the same machine sees Chevrotain hitting almost 4x the amount of ops/sec that ANTLR4 hits.

@RubenVerborgh
Copy link
Owner Author

@jitsedesmet We do need to make a good cost/benefit analysis for this specific case—and our future plans with SPARQL—of using parser generator versus hand-rolling a parser.

Both options come with different maintainability characteristics. I.e., some kinds of maintenance/evolution tasks are easier with one codebase versus another. So let's have a close look, because as you can see above, codebases tend to live longer than you'd want 😅

@RubenVerborgh
Copy link
Owner Author

Although Chevrotain seems to strike a great balance; there is always the option to conditionally (!) include certain rules, our auto-generate some, which is… powerful.

@jacoscaz
Copy link
Contributor

we do need to make a good cost/benefit analysis for this specific case—and our future plans with SPARQL—of using parser generator versus hand-rolling a parser

Contrary to a lot of popular wisdom, I personally think that hand-rolling is (almost) always going to prove superior in the long term (assuming competent programmers, which in this case is a more than safe assumption). However, how long that long term might be is very context-specific.

Given the success of n3, which manages to deal with the complexities of supporting multiple similar formats, the fact that SPARQL doesn't change all that quickly and the fact that the current Jison implementation lasted more than most web frameworks do, hand-rolling should be considered first IMHO.

@rubensworks
Copy link
Collaborator

the fact that SPARQL doesn't change all that quickly

This assumption is likely to change after SPARQL 1.2 comes out, as the W3C WG is planning to be transformed into a maintenance group that could lead to more quickly evolving SPARQL (and RDF) spec versions.
(this is one of the main reasons why @jitsedesmet is investigating this)

@RubenVerborgh
Copy link
Owner Author

the fact that SPARQL doesn't change all that quickly

This assumption is likely to change after SPARQL 1.2 comes out,

And we're also writing this parser for research purposes; we want the liberty to quickly add keywords and constructs to test things and make proposals.

@jitsedesmet
Copy link

Correct, one of the main requirements of this project is expandability/ modularity/ modifications.
Example: Comunica has some new date implementations but cannot support the new adjust function because the parser does not support it.
It would be nice to be able to swap in different parser implementations. (and those implementations not to constantly be written from scratch)

At first I also thought "hand writing this cannot be that hard", but the fault tolerance, for example, already looks hard to me. A lib that can ease our lives in that regard will be nice.

@jacoscaz
Copy link
Contributor

This assumption is likely to change after SPARQL 1.2 comes out, as the W3C WG is planning to be transformed into a maintenance group that could lead to more quickly evolving SPARQL (and RDF) spec versions.

And we're also writing this parser for research purposes; we want the liberty to quickly add keywords and constructs to test things and make proposals.

Correct, one of the main requirements of this project is expandability/ modularity/ modifications.

Nothing like additional context to change my mind! Yeah, given the above hand-rolling would be much harder to justify. I'm somewhat worried to hear that SPARQL might be subject to a faster pace of change but that's a different matter entirely.

All right, I'll follow this from the sidelines but happy to exchange notes whenever!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants