20260601

2026-06-01 10:46:01 +08:00
parent 2faf4bb002
commit e96b955fda
221 changed files with 10219 additions and 332 deletions
--- a/papers/lou-autoharness-2026.md
+++ b/papers/lou-autoharness-2026.md
@@ -0,0 +1,61 @@
+---
+title: "AutoHarness: LLM Agent 的自动代码 Harness 合成"
+created: 2026-05-29
+updated: 2026-05-29
+type: paper
+arxiv: "2603.03329"
+authors: ["Xinghua Lou", "Miguel Lázaro-Gredilla", "Antoine Dedieu", "Carter Wendelken", "Wolfgang Lehrach", "Kevin P. Murphy"]
+venue: "arXiv cs.CL, February 2026"
+tags: ["agent", "code-synthesis", "game-playing", "harness", "LLM"]
+sources: ["https://arxiv.org/abs/2603.03329"]
+---
+
+# AutoHarness: 自动合成代码 Harness 来改进 LLM Agent
+
+> **论文**: Lou, Lázaro-Gredilla, Dedieu, Wendelken, Lehrach & Murphy (Google DeepMind, 2026) — arXiv:2603.03329
+
+## 核心问题
+
+LLM Agent 在游戏等结构环境中频繁产出**非法动作**：在 Kaggle GameArena 国际象棋比赛中，Gemini-2.5-Flash 78% 的失利源于非法走子——不是策略错误，而是**根本违反规则**。
+
+传统方案（手写 harness / fine-tuning）要么脆弱费力，要么昂贵且损害通用能力。**能否让 LLM 自动为自己的"非法行为"合成保护代码？**
+
+## 方法：Code-as-Harness
+
+AutoHarness 用 LLM 自身的代码生成能力来弥合这一鸿沟：
+
+### 搜索机制
+- **Thompson Sampling 引导的树搜索**：在 harness 代码空间中平衡探索与利用
+- LLM 作为 mutation operator：基于环境 feedback 迭代改进代码
+- Critic 提供反馈：动作合法性、环境 reward
+
+### 三种 Harness 模式
+
+| 模式 | 机制 | LLM 角色 |
+|------|------|----------|
+| **[[harness-as-action-verifier|Verifier]]** | LLM 提议 → 代码验证 → 非法则重试 | 策略制定者 |
+| **Action Filter** | 代码生成合法动作集合 → LLM 排序 | 排序者 |
+| **[[harness-as-policy|Policy]]** | 代码直接选动作 → **无需 LLM 推理** | 仅在训练时使用 |
+
+## 关键结果
+
+1. **100% 合法动作率**：在 145 个 TextArena 游戏上完全消除非法动作
+2. **小模型胜大模型**：Gemini-2.5-Flash + Harness 胜 Gemini-2.5-Pro
+3. **Code-as-Policy 巅峰**：生成的纯代码策略在 16 个 1P 游戏上平均 reward **0.870**，超过 GPT-5.2-High (0.844)
+4. **零推理成本**：Harness-as-Policy 测试时成本趋近于零（vs GPT-5.2 的 ~$640）
+
+## 核心洞察
+
+> 用一个较小的模型为自己的"短板"自动合成保护代码，其效果可以超过一个裸奔的更大模型——而且更便宜。
+
+这体现了 [[code-as-harness]] 的根本哲学：**不是让 LLM 变得完美，而是让它可以被代码约束和保护。**
+
+## 概念网络
+
+- [[autoharness]] — 方法总览
+- [[code-as-harness]] — 框架哲学
+- [[harness-as-action-verifier]] — 验证模式
+- [[harness-as-policy]] — 代码即策略
+- [[thompson-sampling-code-search]] — 搜索算法
+- [[iterative-code-refinement]] — 迭代精炼
+- [[action-applicability]] — 动作合法性判定问题