-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New GRPO dataset and tasks: formally-verified program correctness #379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
69b44ca
to
41af428
Compare
…nto feature/htgen-dataset
…nto feature/htgen-dataset
Hi @qgallouedec @lewtun the PR is ready. I would love your feedback on a couple things, and if you could start CI to let me fix any outstanding errors. Thank you! |
Hi @lewtun @qgallouedec ! I've added a second task and more extensive information in the prompt, as well as its respective reward and unit tests. Tests and linting are green. Please let me know what else can I do to land this PR :) Thank you |
Closing this as there is no interest from the maintainers. |
@ocramz no need to delete it tho :( |
A new dataset and family of tasks to verify programs and generate correct programs according to a specification [1][2]. Verification is in the "weakest preconditions" deductive sense, and backed by a SMT solver.
The purpose of these tasks is to measure code understanding capabilities in LLMs as a function of program complexity and size.
TOTALITY_CHECK
andFIX_TRIPLE
tasksOutstanding questions
❓ : where do we need to construct the
<think>..</think><answer>..</answer>
template?❓ : what is missing for integrating this into open-r1 training? (apart from the GRPOTrainer, that is)
Tasks:
TOTALITY_CHECK
: Given a program triple, the model should judge whether it is correct (= satisfiable) or not.FIX_TRIPLE
: Given a program triple with a given label, the model should fix the program to satisfy pre- and postcondition.ℹ️ Example prompts:
TOTALITY_CHECK
task:Below you are given a Python program triple, made of a precondition predicate, a sequence of program statements, and a postcondition predicate. The precondition returns True if the variable environment before beginning the program execution satisfies the predicate, and False otherwise. Similarly, the postcondition returns True if the program environment after the last statement satisfies the predicate, and False otherwise. We say that a triple is correct if, whenever the precondition holds for a given variable assignment, executing the program will produce a variable assignment that satisfies the postcondition. Note that there might be unsatisfiable or contradictory predicates such as 'v1 < v1' or 'v3 > 5 + v3' that make the solution False by definition. You should judge whether the program is 'total', i.e. whether the post-condition evaluates to True for all possible variable assigments that satisfy the precondition. Given a program triple made of program ```v3 = v5 v4 = (4 - (4 - (v5 - 4))) v5 = v4 v4 = (v5 - v3) v3 = 4```, precondition ```v3 > (2 + v4)``` and postcondition ```v3 < 1```, is the postcondition always True at the end of the program ? Please only return 'True' or 'False'.
FIX_TRIPLE
task:Below you are given a Python program triple, made of a precondition predicate, a sequence of program statements, and a postcondition predicate. The precondition returns True if the variable environment before beginning the program execution satisfies the predicate, and False otherwise. Similarly, the postcondition returns True if the program environment after the last statement satisfies the predicate, and False otherwise. We say that a triple is correct if, whenever the precondition holds for a given variable assignment, executing the program will produce a variable assignment that satisfies the postcondition. Note that there might be unsatisfiable or contradictory predicates such as 'v1 < v1' or 'v3 > 5 + v3' that make the solution False by definition. Given a program triple made of program ```v5 = 2\nv3 = v5\nv4 = ((5 + (3 + v3)) + (v4 + v5))\nv4 = 9\nv4 = (v3 - 7)```, precondition ```v3 > v4``` and postcondition ```v5 > 6```, you should modify the program such that the resulting triple is total. Currently, the program triple fails 1 verification condition: if the program starts in state v0 = -42, v1 = -98, v2 = -43, v3 = 34, v4 = 31, v5 = -80, the final environment v5 = -80, v4 = 31, v3 = 34 does not satisfy the postcondition. With this information, the correct program that satisfies the given precondition and postcondition is:
🎁 UnfoldML provides two API endpoints :
ℹ️ The API endpoints are parametric, so it's possible to train the model on small ASTs and test on larger ones, or longer programs, etc. to do small-to-large generalization trials.
v0
,v1
etc.[1] https://en.wikipedia.org/wiki/Correctness_(computer_science)
[2] https://en.wikipedia.org/wiki/Hoare_logic
cc @Muhtasham @vumichien and @lewtun @qgallouedec ^^