20260514:增加新内容
This commit is contained in:
47
concepts/exponential-decay-reward.md
Normal file
47
concepts/exponential-decay-reward.md
Normal file
@@ -0,0 +1,47 @@
|
||||
---
|
||||
title: "指数衰减奖励 (Exponential Decay Reward)"
|
||||
domain: "Reinforcement Learning / Reward Design"
|
||||
tags: [reward, counting, grpo, exponential-decay]
|
||||
sources: [[thinking-with-visual-primitives]]
|
||||
---
|
||||
|
||||
# 指数衰减奖励 (Exponential Decay Reward)
|
||||
|
||||
> 计数任务的平滑奖励函数:不使用二元对错,而是基于相对误差的指数衰减——越接近正确答案奖励越高。
|
||||
|
||||
## 公式
|
||||
|
||||
$$R(\hat{y}, y) = \alpha \cdot \exp\left(-\beta \cdot \frac{|\hat{y} - y|}{|y| + 1}\right)$$
|
||||
|
||||
其中:
|
||||
- $\hat{y}$:预测计数
|
||||
- $y$:真实计数
|
||||
- $|y| + 1$:归一化项,使奖励依赖于**相对误差**
|
||||
- $\alpha = 0.7$:奖励缩放系数
|
||||
- $\beta = 3$:衰减速率
|
||||
|
||||
## 设计动机
|
||||
|
||||
传统二元奖励(对/错)的问题:
|
||||
- 预测 99 vs 真实 100 → 零奖励(与预测 1 vs 100 相同)
|
||||
- 无法提供梯度信号帮助模型「靠近」正确答案
|
||||
|
||||
指数衰减奖励的优势:
|
||||
- **平滑梯度**:预测 99 时仍有高奖励
|
||||
- **相对误差**:大基数场景对小偏差更宽容
|
||||
- **稳定训练**:避免 RL 奖励空间的稀疏问题
|
||||
|
||||
## 示例
|
||||
|
||||
| 预测 | 真实 | 相对误差 | 奖励 |
|
||||
|------|------|----------|------|
|
||||
| 10 | 10 | 0 | 0.7 |
|
||||
| 9 | 10 | 0.091 | 0.53 |
|
||||
| 5 | 10 | 0.455 | 0.18 |
|
||||
| 0 | 10 | 0.909 | 0.046 |
|
||||
|
||||
## 相关概念
|
||||
|
||||
- [[group-relative-policy-optimization|群体相对策略优化]] — 使用此奖励的 RL 算法
|
||||
- [[coarse-grained-counting|粗粒度计数]] / [[fine-grained-counting|细粒度计数]] — 应用场景
|
||||
- [[reward-model|奖励模型]] — 奖励设计体系
|
||||
Reference in New Issue
Block a user