20260601
This commit is contained in:
54
concepts/harness-as-policy.md
Normal file
54
concepts/harness-as-policy.md
Normal file
@@ -0,0 +1,54 @@
|
||||
---
|
||||
title: "Harness-as-Policy (Code as Policy)"
|
||||
created: 2026-05-29
|
||||
updated: 2026-05-29
|
||||
type: concept
|
||||
tags: ["agent", "code-synthesis", "policy", "LLM"]
|
||||
sources: ["https://arxiv.org/abs/2603.03329"]
|
||||
---
|
||||
|
||||
# Harness-as-Policy (Code as Policy)
|
||||
|
||||
**Harness-as-Policy** 是 [[autoharness|AutoHarness]] 的终极形态:LLM 自动生成的代码**直接决定行动**,推理时不调用任何 LLM。这是 [[code-as-harness]] 框架中约束最弱、最灵活、也最大胆的模式。
|
||||
|
||||
## 相比于其他模式的根本区别
|
||||
|
||||
| 模式 | 推理时 LLM 调用 | 代码角色 |
|
||||
|------|:---:|------|
|
||||
| [[harness-as-action-verifier|Verifier]] | ✅ | 合法性守卫 |
|
||||
| Action Filter | ✅ | 候选生成器 |
|
||||
| **Policy** | ❌ | **决策者** |
|
||||
|
||||
## 训练
|
||||
|
||||
- 修改 heuristic value 包含 reward:`H = 0` (illegal) / `H = 0.5 + 0.5r` (legal)
|
||||
- 使用 Gemini-2.5-Flash,最多 256 次迭代
|
||||
- 平均 89.4 次迭代,heuristc value 达 0.939
|
||||
|
||||
## 成果
|
||||
|
||||
在 16 个 TextArena 1P 游戏上:
|
||||
|
||||
| Agent | 平均 Reward | 测试成本 |
|
||||
|-------|:---:|------|
|
||||
| **Harness-as-Policy** (ours) | **0.870** | ~$0 |
|
||||
| GPT-5.2-High | 0.844 | ~$640 |
|
||||
| Gemini-2.5-Pro | 0.707 | moderate |
|
||||
| GPT-5.2 | 0.635 | ~$640 |
|
||||
|
||||
## 核心优势
|
||||
|
||||
1. **零推理成本**:纯 Python 代码运行,不需要 GPU
|
||||
2. **超越大模型**:小模型(Flash)训练出的 code policy 超过 GPT-5.2-High
|
||||
3. **可部署性**:代码可直接在生产环境中运行
|
||||
|
||||
## 局限
|
||||
|
||||
- 2P 游戏需要对手建模 + MCTS,纯代码更难处理
|
||||
- 当前需要为每个游戏单独训练
|
||||
|
||||
## 相关
|
||||
|
||||
- [[code-as-harness]] — 框架哲学
|
||||
- [[autoharness]] — 完整方法
|
||||
- [[lou-autoharness-2026]] — 原始论文
|
||||
Reference in New Issue
Block a user