Official implementation of:
Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards
Reinforcement Learning from Verifiable Rewards (RLVR), such as Group Relative Policy Optimization (GRPO), significantly improves LLM reasoning.
However, we show that:
⚠️ RLVR makes models severely over-confident.
During training:
- Confidence steadily increases
- Over-confidence (PCE) worsens
- Calibration error remains large
Existing calibration-aware RL methods reduce miscalibration, but usually hurt reasoning accuracy due to gradient interference.
DCPO (Decoupled Calibration Policy Optimization) separates:
- 🧠 Reasoning optimization
- 📊 Confidence calibration
at the token level.
- Seperate reasoning tokens and confidence tokens
- Apply accuracy reward to reasoning tokens
- Apply calibration reward to confidence tokens
- Use masked gradients to avoid interference
This simple decoupling eliminates the accuracy–calibration conflict.
Evaluated on:
- MATH-500
- AIME 2024 / 2025
- AMC 2023 / 2024
Base model: Qwen3-8B
The code has been successfully tested on 8 × 80GB A100 GPUs with CUDA 12.8.
To create a Conda environment, run the following commands:
git clone https://github.com/mazhengzhao/DCPO.git
cd DCPO
conda env create -f environment.ymlAfter setting up the environment, run the following command to start training:
bash examples/Qwen3-8B.shTo compute evaluation metrics such as Accuracy, Expected Calibration Error (ECE), Brier Score (BS) and Positive Calibration Error (PCE)
Deploy a vllm service of your model
python -m vllm.entrypoints.openai.api_server \
--model models/Qwen3-8B \
--tensor-parallel-size 1 \
--dtype float16 \
--max-model-len 8192 \
--port 8000 identify your model name and service url in examples/eval.sh and run:
bash examples/eval.shThe script will log output in folder logs/model_name/ and plot calibration curves in Figs/model_name.
This repository builds upon the following open-source projects, to which we are deeply grateful: verl, AR-Lopti, LogicRL, DeepScaleR, AdaRFT, CCGSPG

