Skip to content

confidence reward和accuracy reward分配反了 #2

@ShiyuNee

Description

@ShiyuNee

Bug: DCPO 中 accuracy 和 calibration advantage 信号交叉反转

问题描述

HybridRewardManager(及 ConfidenceRewardManager)+ compute_confidence_outcome_advantage 的配合中,accuracy 信号和 calibration 信号在 advantage 分配时交叉反转——回答区域的 token 被校准信号驱动,confidence 区域的 token 被正确性信号驱动,与设计意图相反。

根因

confidence_pos_list 存的是 accuracy reward 所在的位置conf_token_pos - 1),而非 confidence token 的位置。compute_confidence_outcome_advantage 从该位置读取时,名义上的 confidence_scores 实际读到的是 accuracy reward,answer_scores 实际读到的是 calibration penalty。

涉及代码

1. hybrid.py 第137-149行 — reward 写入位置

# hybrid.py:137-149
if seperate_confidence:
    confidence_score, confidence_pos = token_reward_func(prompt=prompt, response=response, tokenizer=self.tokenizer)
    
    if confidence_pos is not None:
        pos = max(confidence_pos - 1, 0)       # ← accuracy reward 放在 conf_token_pos - 1
        reward_tensor[i, pos] = reward           # ← 写入 accuracy (0 或 1)
        confidence_pos_list.append(pos)          # ← 存的是 conf_token_pos - 1
        extra_info["confidence_pos"] = confidence_pos
    else:
        pos = max(valid_response_length - 2, 0)  # ← accuracy reward 放在倒数第二
        reward_tensor[i, pos] = reward
        confidence_pos_list.append(pos)
        extra_info["confidence_pos"] = valid_response_length - 1

2. hybrid.py 第209-216行 — calibration penalty 写入位置

# hybrid.py:209-216
for i in range(len(data)):
    data_item = data[i]
    prompt_ids = data_item.batch['prompts']
    prompt_length = prompt_ids.shape[-1]
    valid_response_length = data_item.batch['attention_mask'][prompt_length:].sum()

    reward_tensor[i, valid_response_length - 1] -= weight * calibration_scores[i]
    # ↑ calibration 总是放在最后一个 token

3. hybrid.py 第218行 — confidence_pos 传入 non_tensor_batch

# hybrid.py:218
reward_extra_info["confidence_pos"] = pos  # pos = confidence_pos_list
# → 存的是 [conf_token_pos - 1, ...], 即 accuracy reward 所在位置

4. core_algos.py 第208-211行 — 从 reward_tensor 提取两个分数

# core_algos.py:208-211
for i in range(bs):
    cpos = confidence_pos[i].item()    # = conf_token_pos - 1
    confidence_scores[i] = token_level_rewards[i, cpos]          # 读到 accuracy_reward!
    answer_scores[i] = token_level_rewards[i].sum() - confidence_scores[i]  # 读到 -0.5 × cal!

5. core_algos.py 第228-236行 — 按 token 区域分配 advantage

# core_algos.py:228-236
for i in range(bs):
    cpos = confidence_pos[i].item()
    for t in range(response_len):
        if not response_mask[i, t]:
            continue
        if t >= cpos:                    # confidence 区域
            advantages[i, t] = confidence_adv[i]   # 实际分配的是 accuracy advantage
        else:                            # 回答区域
            advantages[i, t] = answer_adv[i]        # 实际分配的是 calibration advantage

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions