20260601

2026-06-01 10:46:01 +08:00
parent 2faf4bb002
commit e96b955fda
221 changed files with 10219 additions and 332 deletions
--- a/concepts/negative-sample-reinforcement.md
+++ b/concepts/negative-sample-reinforcement.md
@@ -0,0 +1,46 @@
+---
+title: "Negative Sample Reinforcement (NSR)"
+created: 2026-05-18
+type: concept
+tags: ["reinforcement-learning", "LLM", "GRPO", "reasoning"]
+sources: ["https://arxiv.org/abs/2604.14142"]
+---
+
+# Negative Sample Reinforcement (NSR)
+
+## 定义
+
+NSR 是 RL 中针对**负样本**（获得负 advantage 的样本）进行强化的机制：通过最小化 log π(y|x) 来**抑制**错误推理轨迹。在预训练空间 P(y) 中，NSR 展现出远超 [[positive-sample-reinforcement|PSR]] 的效果。
+
+## 核心发现
+
+### NSR-PreRL 的效果
+
+1. **剪枝错误路径**：有效消除 universal incorrect patterns
+2. **激发内生推理**：transition thoughts **14.89×**，reflection thoughts **6.54×**
+3. **样本效率**：仅需 20 步 NSR-PreRL 即达到标准 RL 需要 60+ 步的精度（AMC23: 86%）
+4. **双刃剑**：过度 NSR 会导致输出过长，阻碍后续训练
+
+### 与 NSR-RL 的对比
+
+| 方法 | Avg@32 (Qwen3-4B) |
+|------|-------------------|
+| Vanilla | 41.26 |
+| GRPO | 55.79 |
+| NSR-RL Warmup | 54.38 |
+| **NSR-PreRL Warmup (DSRL)** | **57.54** |
+
+NSR-RL 在 post-train 空间的 warmup 甚至**低于** GRPO 基线，证明 NSR 的效力依赖于在预训练空间中操作。
+
+## 机制解释
+
+- 在预训练空间中，NSR 重新分配概率质量——从错误轨迹转移到正确推理方向
+- 这种概率重新分配保留了探索能力（不同于直接锐化条件分布）
+- NSR-PreRL 提供的初始化使后续 RL 可以专注于问题特定的细粒度优化
+
+## 相关概念
+
+- [[positive-sample-reinforcement|PSR]] — 正样本强化的退化问题
+- [[pre-train-space-reinforcement-learning|PreRL]]
+- [[dual-space-rl|DSRL]]
+- [[endogenous-reasoning|内生推理]]