🚀 DCPO

Decoupling Reasoning and Confidence in RL from Verifiable Rewards

Official implementation of:

Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

🔎 Motivation

Reinforcement Learning from Verifiable Rewards (RLVR), such as Group Relative Policy Optimization (GRPO), significantly improves LLM reasoning.

However, we show that:

⚠️ RLVR makes models severely over-confident.

During training:

Confidence steadily increases
Over-confidence (PCE) worsens
Calibration error remains large

Existing calibration-aware RL methods reduce miscalibration, but usually hurt reasoning accuracy due to gradient interference.

💡 Our Solution: DCPO

DCPO (Decoupled Calibration Policy Optimization) separates:

🧠 Reasoning optimization
📊 Confidence calibration

at the token level.

Key idea

Seperate reasoning tokens and confidence tokens
Apply accuracy reward to reasoning tokens
Apply calibration reward to confidence tokens
Use masked gradients to avoid interference

This simple decoupling eliminates the accuracy–calibration conflict.

📊 Results

Evaluated on:

MATH-500
AIME 2024 / 2025
AMC 2023 / 2024

Base model: Qwen3-8B

Environment Setup

The code has been successfully tested on 8 × 80GB A100 GPUs with CUDA 12.8.
To create a Conda environment, run the following commands:

git clone https://github.com/mazhengzhao/DCPO.git
cd DCPO
conda env create -f environment.yml

Running the Code

After setting up the environment, run the following command to start training:

bash examples/Qwen3-8B.sh

Evaluating Metrics

To compute evaluation metrics such as Accuracy, Expected Calibration Error (ECE), Brier Score (BS) and Positive Calibration Error (PCE)

Deploy a vllm service of your model

python -m vllm.entrypoints.openai.api_server \
    --model models/Qwen3-8B \
    --tensor-parallel-size 1 \
    --dtype float16 \
    --max-model-len 8192 \
    --port 8000

identify your model name and service url in examples/eval.sh and run:

bash examples/eval.sh

The script will log output in folder logs/model_name/ and plot calibration curves in Figs/model_name.

🙏 Acknowledgements

This repository builds upon the following open-source projects, to which we are deeply grateful: verl, AR-Lopti, LogicRL, DeepScaleR, AdaRFT, CCGSPG

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Figs		Figs
data		data
examples		examples
recipe		recipe
scripts		scripts
tests		tests
verl		verl
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
.style.yapf		.style.yapf
Readme.md		Readme.md
calibration_main.py		calibration_main.py
environment.yml		environment.yml
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 DCPO

Decoupling Reasoning and Confidence in RL from Verifiable Rewards

🔎 Motivation

💡 Our Solution: DCPO

Key idea

📊 Results

Environment Setup

Running the Code

Evaluating Metrics

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚀 DCPO

Decoupling Reasoning and Confidence in RL from Verifiable Rewards

🔎 Motivation

💡 Our Solution: DCPO

Key idea

📊 Results

Environment Setup

Running the Code

Evaluating Metrics

🙏 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages