Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing Fuzzer Fixes #7672

Merged
merged 43 commits into from
Mar 11, 2025
Merged

Parsing Fuzzer Fixes #7672

merged 43 commits into from
Mar 11, 2025

Conversation

lukewilliamboswell
Copy link
Collaborator

@lukewilliamboswell lukewilliamboswell commented Mar 7, 2025

This PR

  • adds tokenise and parser diagnostics to ModuleEnv as Problems
  • fix out of bound issue in parse.zig from line numbers
  • fix memory leak by free'ing parser errors in deinit
  • adds eleven (11) new Diagnostic errors for the Parser
  • removes panics and handle malformed parsed AST nodes
  • implements a significant number of SExpr nodes for the Parser AST
  • improves reliability for handling newlines in Parser
  • upgrades the snapshot tool with new PROBLEMS, TOKENS, and FORMATTED sections
  • adds eleven (11) new snapshots, discovered from fuzzing the parser

@@ -23,6 +23,7 @@ errors: []const Diagnostic,
pub fn deinit(self: *IR) void {
defer self.tokens.deinit();
defer self.store.deinit();
self.store.gpa.free(self.errors);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixed our memory leak... not sure it's 100% correct

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably correct for now, but might be wrong in the long term. Long term, we may want errors to outlive the rest of parse. So just like tokenize has finishAndDeinit, we may need something similar that returns the errors for later use but frees everything else. Not a now problem, but just context.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, now you mention it... the errors should be pushed into the ModuleEnv as problems, not in the IR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'd rather not do this in deinit, errors should be owned by the owner of the parse result

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I leave this for you @gamebox to change in a follow up? I've left it here as it keeps the fuzzer happy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the PR moved this into ModuleEnv?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the more proper short term fix would be to pull this out of deinit and instead add it next to all current callsites of parse_ast.deint(). That gives the caller control of when to free the error list.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to leave this for a follow up, where we focus on unifying the approach to handling Problems across the compiler.

@lukewilliamboswell lukewilliamboswell changed the title Fix malformed node in header Parsing Fuzzer Fixes Mar 9, 2025
@@ -23,6 +23,7 @@ errors: []const Diagnostic,
pub fn deinit(self: *IR) void {
defer self.tokens.deinit();
defer self.store.deinit();
self.store.gpa.free(self.errors);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'd rather not do this in deinit, errors should be owned by the owner of the parse result

@@ -596,6 +608,11 @@ pub const NodeStore = struct {
node.data.lhs = mod.exposes.span.start;
node.data.rhs = mod.exposes.span.len;
},
.malformed => |a| {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I just said this morning that I need to do exactly this. Looks great for the most part.

Having region will help with this quite a bit, will be adding that to node soon

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, region will mean we can provide much nicer formatted error messages

@@ -117,7 +117,9 @@ pub fn pushDiagnostic(self: *Parser, tag: IR.Diagnostic.Tag, region: IR.Region)
/// add a malformed token
pub fn pushMalformed(self: *Parser, comptime t: type, tag: IR.Diagnostic.Tag) t {
const pos = self.pos;
self.advanceOne(); // TODO: find a better point to advance to
if (self.peek() != .EndOfFile) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want pushMalformed to include a advance_to Token.Tag so that it takes care of this (it will also need a start position in order to create a real Region)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. 👍

@lukewilliamboswell
Copy link
Collaborator Author

lukewilliamboswell commented Mar 10, 2025

(removed)

@lukewilliamboswell lukewilliamboswell changed the title Parsing Fuzzer Fixes [WIP] Parsing Fuzzer Fixes Mar 10, 2025
Comment on lines +248 to +252
// TODO -- this is a hack to avoid ambiguity with no arguments,
// if we parse it again without the space it will be parsed as
// a logical OR `||` instead
//
// desired behaviour described here https://roc.zulipchat.com/#narrow/channel/395097-compiler-development/topic/zig.20compiler.20-.20spike/near/504453049
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving this for someone to do in a follow up PR, for now we're keeping the fuzzer happy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sounds like the correct fix is minor and in the tokenizer. Just remove || and && parsing. Then it is not ambiguous.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've merged the changes into the tokeniser. My preference is to revert this in a follow up PR and not add more to this.

@lukewilliamboswell lukewilliamboswell changed the title [WIP] Parsing Fuzzer Fixes Parsing Fuzzer Fixes Mar 11, 2025
Copy link
Member

@bhansconnect bhansconnect left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave a quick review. Leaving final review to @gamebox or @joshuawarner32

Comment on lines +122 to +124
"obj"
else
"o";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is in the old compiler and probably wrong?

I would guess that zig 0.13.0 filters this differently than zig 0.14.0

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it was zig 0.14.0 that made this change. But old-workflows have ran this over and been happy so I left it.

@@ -23,6 +23,7 @@ errors: []const Diagnostic,
pub fn deinit(self: *IR) void {
defer self.tokens.deinit();
defer self.store.deinit();
self.store.gpa.free(self.errors);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the PR moved this into ModuleEnv?

@@ -23,6 +23,7 @@ errors: []const Diagnostic,
pub fn deinit(self: *IR) void {
defer self.tokens.deinit();
defer self.store.deinit();
self.store.gpa.free(self.errors);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the more proper short term fix would be to pull this out of deinit and instead add it next to all current callsites of parse_ast.deint(). That gives the caller control of when to free the error list.

Comment on lines +248 to +252
// TODO -- this is a hack to avoid ambiguity with no arguments,
// if we parse it again without the space it will be parsed as
// a logical OR `||` instead
//
// desired behaviour described here https://roc.zulipchat.com/#narrow/channel/395097-compiler-development/topic/zig.20compiler.20-.20spike/near/504453049
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sounds like the correct fix is minor and in the tokenizer. Just remove || and && parsing. Then it is not ambiguous.

(malformed_expr "unexpected_token")
(malformed_expr "unexpected_token")
(malformed_expr "unexpected_token")
(ident "" "le")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect here something like:

(statement
    (expr
        (ident "" "le")))

Copy link
Member

@bhansconnect bhansconnect left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely think there are things to deal with in follow up, but I think it is fine to merge as is.

Comment on lines +25 to 27
for (result.messages) |msg| {
_ = env.problems.append(env.gpa, .{ .tokenize = msg });
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify here, the tokenizer and parser are building a local list of errors and we are later adding them to the global list?

I would expect that the tokenzer and parser just add directly to the global list. Any specific reason for forcing the caller to move everything into the global list.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just that the Parser was built independently of the ModuleEnv design, and we haven't yet converged on a final form. I think we should we will just add to the module environment, which helps with reporting and various other things that cut across the whole compiler. But that change can happen in a future PR.

Comment on lines +866 to +868
// disabled because it was hit by a fuzz test
// for a repro see src/snapshots/fuzz_crash_012.txt
// std.debug.assert(a.patterns.span.len > 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, but we should think about ways to leave in asserts while still documenting fuzzer failures.

Cause I don't think the correct answer to a fuzzer failure on an assert is to disable the assert. Like I don't want to make this a general practice. I think it is a bad precedence.

Comment on lines 539 to 541
} else {
// If not a decl
const expr = self.parseExpr();
const statement_idx = self.store.addStatement(.{ .expr = .{
.expr = expr,
.region = .{ .start = start, .end = start },
} });
if (self.peek() == .Newline) {
self.advance();
}
return statement_idx;
// continue to parse final expression
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove the empty else? Maybe put a comment before the if for clarity if it is needed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, make parsing the final expression a function and return parseFinalExpr(...) in each else here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought including the else and a comment was more explicit that it was falling through to the last part.

Happy to look at this in a follow up PR. I was removing some duplicate code here, but it's might be better to revert my change and leave it as it was.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be fair, might have just looked odd due to github pr review diff. I would lead to pulling it out into a function.

Ah yeah, looking at the raw source, this looks good. I think the diff view just made it look really spread out and unrelated.

That said, I think I would still prefer the final expression be a separate function. So it would be else { return parseFinalExpr(...); } same with the else => {...} branch of the switch. Then the switch will truly handle all cases with no fallthrough. Makes sure that someone doesn't accidentally fall through to the final expression parsing when they didn't mean to.

Comment on lines +559 to 561
} else {
// continue to parse final expression
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@lukewilliamboswell lukewilliamboswell merged commit 13c5152 into main Mar 11, 2025
30 of 32 checks passed
@lukewilliamboswell lukewilliamboswell deleted the fuzz-fixes branch March 11, 2025 22:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants