SidneyZhang/myWiki

Files

Sidney Zhang e96b955fda

20260601

2026-06-01 10:46:01 +08:00

1.6 KiB

Raw Blame History

title, created, updated, type, tags, sources

title

created

updated

type

tags

sources

Harness-as-Policy (Code as Policy)

2026-05-29

2026-05-29

concept

agent

code-synthesis

policy

LLM

https://arxiv.org/abs/2603.03329

Harness-as-Policy (Code as Policy)

Harness-as-Policy 是 autoharness 的终极形态：LLM 自动生成的代码直接决定行动，推理时不调用任何 LLM。这是 code-as-harness 框架中约束最弱、最灵活、也最大胆的模式。

相比于其他模式的根本区别

模式	推理时 LLM 调用	代码角色
[[harness-as-action-verifier	Verifier]]	✅
Action Filter	✅	候选生成器
Policy	❌	决策者

训练

修改 heuristic value 包含 reward：H = 0 (illegal) / H = 0.5 + 0.5r (legal)
使用 Gemini-2.5-Flash，最多 256 次迭代
平均 89.4 次迭代，heuristc value 达 0.939

成果

在 16 个 TextArena 1P 游戏上：

Agent	平均 Reward	测试成本
Harness-as-Policy (ours)	0.870	~$0
GPT-5.2-High	0.844	~$640
Gemini-2.5-Pro	0.707	moderate
GPT-5.2	0.635	~$640

核心优势

零推理成本：纯 Python 代码运行，不需要 GPU
超越大模型：小模型（Flash）训练出的 code policy 超过 GPT-5.2-High
可部署性：代码可直接在生产环境中运行

局限

2P 游戏需要对手建模 + MCTS，纯代码更难处理
当前需要为每个游戏单独训练

相关

code-as-harness — 框架哲学
autoharness — 完整方法
lou-autoharness-2026 — 原始论文