New GRPO dataset and tasks: formally-verified program correctness #379

ocramz · 2025-02-20T04:26:27Z

A new dataset and family of tasks to verify programs and generate correct programs according to a specification [1][2]. Verification is in the "weakest preconditions" deductive sense, and backed by a SMT solver.

The purpose of these tasks is to measure code understanding capabilities in LLMs as a function of program complexity and size.

✅ add API bindings for generating prompts and scoring model completions
✅ add reward functions for the TOTALITY_CHECK and FIX_TRIPLE tasks
✅ add unit/integration tests for the new API bindings and reward functions
✅ small change to Makefile to install uv and dev dependencies

Outstanding questions
❓ : where do we need to construct the <think>..</think><answer>..</answer> template?
❓ : what is missing for integrating this into open-r1 training? (apart from the GRPOTrainer, that is)

Tasks:

TOTALITY_CHECK : Given a program triple, the model should judge whether it is correct (= satisfiable) or not.
FIX_TRIPLE : Given a program triple with a given label, the model should fix the program to satisfy pre- and postcondition.

ℹ️ Example prompts:

TOTALITY_CHECK task:
Below you are given a Python program triple, made of a precondition predicate, a sequence of program statements, and a postcondition predicate. The precondition returns True if the variable environment before beginning the program execution satisfies the predicate, and False otherwise. Similarly, the postcondition returns True if the program environment after the last statement satisfies the predicate, and False otherwise. We say that a triple is correct if, whenever the precondition holds for a given variable assignment, executing the program will produce a variable assignment that satisfies the postcondition. Note that there might be unsatisfiable or contradictory predicates such as 'v1 < v1' or 'v3 > 5 + v3' that make the solution False by definition. You should judge whether the program is 'total', i.e. whether the post-condition evaluates to True for all possible variable assigments that satisfy the precondition. Given a program triple made of program ```v3 = v5 v4 = (4 - (4 - (v5 - 4))) v5 = v4 v4 = (v5 - v3) v3 = 4```, precondition ```v3 > (2 + v4)``` and postcondition ```v3 < 1```, is the postcondition always True at the end of the program ? Please only return 'True' or 'False'.
FIX_TRIPLE task:
Below you are given a Python program triple, made of a precondition predicate, a sequence of program statements, and a postcondition predicate. The precondition returns True if the variable environment before beginning the program execution satisfies the predicate, and False otherwise. Similarly, the postcondition returns True if the program environment after the last statement satisfies the predicate, and False otherwise. We say that a triple is correct if, whenever the precondition holds for a given variable assignment, executing the program will produce a variable assignment that satisfies the postcondition. Note that there might be unsatisfiable or contradictory predicates such as 'v1 < v1' or 'v3 > 5 + v3' that make the solution False by definition. Given a program triple made of program ```v5 = 2\nv3 = v5\nv4 = ((5 + (3 + v3)) + (v4 + v5))\nv4 = 9\nv4 = (v3 - 7)```, precondition ```v3 > v4``` and postcondition ```v5 > 6```, you should modify the program such that the resulting triple is total. Currently, the program triple fails 1 verification condition: if the program starts in state v0 = -42, v1 = -98, v2 = -43, v3 = 34, v4 = 31, v5 = -80, the final environment v5 = -80, v4 = 31, v3 = 34 does not satisfy the postcondition. With this information, the correct program that satisfies the given precondition and postcondition is:

🎁 UnfoldML provides two API endpoints :

program triple data generation : random program generation in a toy subset of Python, together with preconditions and postconditions. Additionally, the API endpoint returns the execution "trace" (how the variable environment is mutated after each statement).
program triple verification : parse model completions and return verification answer.

ℹ️ The API endpoints are parametric, so it's possible to train the model on small ASTs and test on larger ones, or longer programs, etc. to do small-to-large generalization trials.

⚠️ Limitations: The dataset uses a very restricted fragment of Python:

Only straight-line programs and no class declarations, decorators, datatypes etc.
Identifiers only denoted by v0, v1 etc.
Only arithmetic operations on the RHS of program statements

[1] https://en.wikipedia.org/wiki/Correctness_(computer_science)
[2] https://en.wikipedia.org/wiki/Hoare_logic

cc @Muhtasham @vumichien and @lewtun @qgallouedec ^^

…nto feature/htgen-dataset

ocramz · 2025-02-25T05:02:24Z

Hi @qgallouedec @lewtun the PR is ready. I would love your feedback on a couple things, and if you could start CI to let me fix any outstanding errors. Thank you!

ocramz · 2025-03-02T16:43:55Z

Hi @lewtun @qgallouedec ! I've added a second task and more extensive information in the prompt, as well as its respective reward and unit tests. Tests and linting are green.

Please let me know what else can I do to land this PR :) Thank you

ocramz · 2025-05-05T12:45:52Z

Closing this as there is no interest from the maintainers.

Kreijstal · 2025-05-31T17:27:21Z

@ocramz no need to delete it tho :(

Marco Zocca added 2 commits February 20, 2025 04:57

wip adding HTGen dataset and benchmark

591f3c1

add API test

69b44ca

ocramz mentioned this pull request Feb 20, 2025

Datasets for code #28

Open

ocramz changed the title ~~WIP Feature/htgen dataset~~ WIP Feature/htgen dataset and task Feb 20, 2025

ocramz changed the title ~~WIP Feature/htgen dataset and task~~ WIP new GRPO dataset and task: formally-verified program correctness Feb 20, 2025

Marco Zocca added 2 commits February 21, 2025 08:12

wip adding HTGen dataset and benchmark

b3a9587

add API test

41af428

ocramz force-pushed the feature/htgen-dataset branch from 69b44ca to 41af428 Compare February 21, 2025 07:12

Marco Zocca and others added 6 commits February 22, 2025 07:56

construct prompt in the dataset generator

58cbde4

Merge branch 'main' into feature/htgen-dataset

aa3523d

merge from upstream

cde339c

prompt construction

2d225e9

fix some typos and add more docstrings

d28e8f7

add reward

40fdef8

Muhtasham approved these changes Feb 22, 2025

View reviewed changes

ocramz and others added 6 commits February 23, 2025 04:52

Merge branch 'main' into feature/htgen-dataset

4341395

fix typos

3eae18c

Merge branch 'feature/htgen-dataset' of github.com:unfoldml/open-r1 i…

ce8d1b1

…nto feature/htgen-dataset

Merge branch 'main' into feature/htgen-dataset

5e2ea33

Merge branch 'feature/htgen-dataset' of github.com:unfoldml/open-r1 i…

8535700

…nto feature/htgen-dataset

add unit test for code rewards

410b4f9

ocramz changed the title ~~WIP new GRPO dataset and task: formally-verified program correctness~~ New GRPO dataset and task: formally-verified program correctness Feb 25, 2025

ocramz and others added 7 commits February 25, 2025 17:23

Merge branch 'main' into feature/htgen-dataset

465ae8c

Merge branch 'main' into feature/htgen-dataset

fbd20c7

Merge branch 'main' into feature/htgen-dataset

c567d55

docstring

d63b4a2

Merge branch 'main' into feature/htgen-dataset

0c9732c

fix makefile and tests

c84f645

fix code rewards test

ff0db1e

ocramz added 5 commits March 2, 2025 14:56

add prompt and fix_triple reward

1793479

fix makefile to activate venv correctly

3303083

fix_triple task: add reward tests and docstrings

532a012

readme

66969e8

fix style and quality

3f88b06

ocramz changed the title ~~New GRPO dataset and task: formally-verified program correctness~~ New GRPO dataset and tasks: formally-verified program correctness Mar 2, 2025

ocramz and others added 11 commits March 2, 2025 19:24

cannot reliably activate venv within makefile

604f66f

ignore API json parsing errors

ed0c484

cleanup and docstrings

6e4298c

add test for verify v2 endpoint

aa97e62

Merge branch 'main' into feature/htgen-dataset

334b4b0

Merge branch 'main' into feature/htgen-dataset

3bd8689

Merge branch 'main' into feature/htgen-dataset

40736bd

Merge branch 'main' into feature/htgen-dataset

456543a

Merge branch 'main' into feature/htgen-dataset

bbf700a

Merge branch 'main' into feature/htgen-dataset

f3f2166

Merge branch 'main' into feature/htgen-dataset

cde9a89

ocramz closed this May 5, 2025

ocramz deleted the feature/htgen-dataset branch May 5, 2025 12:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New GRPO dataset and tasks: formally-verified program correctness #379

New GRPO dataset and tasks: formally-verified program correctness #379

Uh oh!

ocramz commented Feb 20, 2025 •

edited

Loading

Uh oh!

ocramz commented Feb 25, 2025

Uh oh!

ocramz commented Mar 2, 2025

Uh oh!

ocramz commented May 5, 2025

Uh oh!

Kreijstal commented May 31, 2025

Uh oh!

Uh oh!

New GRPO dataset and tasks: formally-verified program correctness #379

New GRPO dataset and tasks: formally-verified program correctness #379

Uh oh!

Conversation

ocramz commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ocramz commented Feb 25, 2025

Uh oh!

ocramz commented Mar 2, 2025

Uh oh!

ocramz commented May 5, 2025

Uh oh!

Kreijstal commented May 31, 2025

Uh oh!

Uh oh!

ocramz commented Feb 20, 2025 •

edited

Loading