Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add plan diversity and update rating prompt in CePO #175

Merged
merged 1 commit into from
Mar 19, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 15 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -270,7 +270,7 @@ optillm supports various command-line arguments for configuration. When using Do
| `--return-full-response` | Return the full response including the CoT with <thinking> tags | `False` |
| `--port` | Specify the port to run the proxy | 8000 |
| `--optillm-api-key` | Optional API key for client authentication to optillm | `""` |
| `--cepo_*` | See CePO Parameters section below for detailed configuration options | Various |
| `--cepo_*` | See CePO Parameters section below for detailed config options | Various |

<details>
<summary><strong>CePO Parameters</strong></summary>
Expand All @@ -292,7 +292,9 @@ optillm supports various command-line arguments for configuration. When using Do
| `--cepo_planning_max_tokens_step3` | Maximum number of tokens in step 3 of planning stage | 4096 |
| `--cepo_planning_max_tokens_step4` | Maximum number of tokens in step 4 of planning stage | 4096 |
| `--cepo_print_output` | Whether to print the output of each stage | `False` |
| `--cepo_config_file` | Path to CePO configuration file | None |
| `--cepo_config_file` | Path to CePO configuration file | `None` |
| `--cepo_use_plan_diversity` | Use additional plan diversity step | `False` |
| `--cepo_rating_model` | Specify a model for rating step if different than for completion | `None` |

</details>

Expand Down Expand Up @@ -341,14 +343,17 @@ Authorization: Bearer your_secret_api_key

## SOTA results on benchmarks with optillm

### CePO on math and code benchmarks (Jan 2025)

| Method | Math-L5 | MMLU-Pro (Math) | GPQA | CRUX | LiveCodeBench (pass@1) | Simple QA |
| -------------------------: | :-----: | :-------------: | :--: | :--: | :--------------------: | :-------: |
| Llama 3.1 70B | 41.6 | 72.9 | 41.7 | 64.2 | 24.5 | 14.7 |
| Llama 3.3 70B | 51.0 | 78.6 | 49.1 | 72.6 | 27.1 | 20.9 |
| Llama 3.1 405B | 49.8 | 79.2 | 50.7 | 73.0 | 31.8 | 13.5 |
| CePO (using Llama 3.3 70B) | 69.6 | 84.8 | 55.5 | 80.1 | 31.9 | 22.6 |
### CePO on math and code benchmarks (Mar 2025)

| Method | Math-L5 | MMLU-Pro (Math) | CRUX | LiveCodeBench (pass@1) | Simple QA |
| -----------------------------: | :-----: | :-------------: | :----: | :--------------------: | :-------: |
| Llama 3.3 70B | 51.0 | 78.6 | 72.6 | 27.1 | 20.9 |
| Llama 3.1 405B | 49.8 | 79.2 | 73.0 | 31.8 | 13.5 |
| CePO (using Llama 3.3 70B) | 69.6 | 84.8 | 80.1 | 31.9 | **22.6** |
| QwQ 32B | 61.4 | 90.8 | 82.5 | 44.3 | 7.8 |
| CePO (using QwQ 32B) | 88.1 | **92.0** | 86.3 | **51.5** | 8.2 |
| DeepSeek R1 Llama | 83.1 | 82.0 | 84.0 | 47.3 | 14.6 |
| CePO (using DeepSeek R1 Llama) |**90.2** | 84.0 |**89.4**| 47.2 | 15.5 |

### coc-claude-3-5-sonnet-20241022 on AIME 2024 pass@1 (Nov 2024)

Expand Down
20 changes: 2 additions & 18 deletions optillm/cepo/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ If you have any questions or want to contribute, please reach out to us on [cere

## CePO Methodology

In CePO, the Best of N technique is applied to `bestofn_n` solution candidates. Each solution is generated through the following four steps:
In CePO, the Best of N technique is applied to `bestofn_n` solution candidates. Optionally (when `cepo_use_plan_diversity` is set to `True`), the model will attempt to come up with diverse approaches for each of best of n completions. Each completion is generated through the following four steps:

**Step 1**: Plan Generation
The model generates a detailed, step-by-step plan to solve the problem, along with its confidence level for each step.
Expand All @@ -25,20 +25,4 @@ The model uses the refined plan from Step 3 to produce the final answer.

## CePO Current Status

This project is a work in progress, and the provided code is in an early experimental stage. While the proposed approach works well across the benchmarks we tested, further improvements can be achieved by task-specific customizations to prompts.

## CePO Ablation studies

We conducted ablation studies to evaluate the impact of various hyperparameters in the CePO framework. Our results indicate that the chosen hyperparameter settings strike a good balance between computational cost and accuracy.

Interestingly, the self-critique and quality improvement capabilities of existing off-the-shelf models do not always scale proportionally with increased inference compute. Addressing this limitation remains a key focus, and we plan to explore custom model fine-tuning as a potential solution in the future.

| bestofn_n | planning_n | planning_m | bestofn_rating_type | Math-L5 | MMLU-Pro (Math) | GPQA | CRUX | Comments |
| :-------: | :--------: | :--------: | :-----------------: | :-----: | :-------------: | :---: | :---: | :------------- |
| 3 | 3 | 6 | absolute | 69.6 | 84.8 | 55.5 | 80.1 | Default config |
| 3 | 3 | 6 | pairwise | 67.7 | 83.5 | 55.6 | 79.8 | |
| 3 | 2 | 5 | absolute | 67.1 | 85.1 | 55.1 | 79.0 | |
| 3 | 5 | 8 | absolute | 69.4 | 84.3 | 55.6 | 81.1 | |
| 5 | 3 | 6 | absolute | 68.7 | 85.4 | 54.8 | 79.9 | |
| 7 | 3 | 6 | absolute | 69.6 | 82.8 | 54.7 | 78.4 | |
| 9 | 3 | 6 | absolute | 68.9 | 83.4 | 55.7 | 80.6 | |
This project is a work in progress, and the provided code is in an early experimental stage. While the proposed approach works well across the benchmarks we tested, further improvements can be achieved by task-specific customizations to prompts.
Loading