29 lines
2.3 KiB
Markdown
29 lines
2.3 KiB
Markdown
---
|
|
title: "AutoHarness: improving LLM agents by automatically synthesizing a code harness"
|
|
created: 2026-05-29
|
|
type: paper-raw
|
|
arxiv: "2603.03329"
|
|
authors: ["Xinghua Lou", "Miguel Lázaro-Gredilla", "Antoine Dedieu", "Carter Wendelken", "Wolfgang Lehrach", "Kevin P. Murphy"]
|
|
venue: "arXiv preprint (cs.CL), February 2026"
|
|
affiliation: "Google DeepMind"
|
|
tags: ["agent", "code-synthesis", "game-playing", "harness", "LLM"]
|
|
---
|
|
|
|
# AutoHarness: improving LLM agents by automatically synthesizing a code harness
|
|
|
|
**Authors:** Xinghua Lou, Miguel Lázaro-Gredilla, Antoine Dedieu, Carter Wendelken, Wolfgang Lehrach, Kevin P. Murphy
|
|
**Affiliation:** Google DeepMind
|
|
**arXiv:** [2603.03329](https://arxiv.org/abs/2603.03329) (v1, 10 February 2026)
|
|
**Category:** cs.CL (Computation and Language)
|
|
|
|
## Abstract
|
|
|
|
Despite significant strides in language models in the last few years, when used as agents, such models often try to perform actions that are not just suboptimal for a given state, but are strictly prohibited by the external environment. For example, in the recent Kaggle GameArena chess competition, 78% of Gemini-2.5-Flash losses were attributed to illegal moves. Often people manually write "harnesses" around LLMs to prevent such failures. In this paper, we demonstrate that Gemini-2.5-Flash can automatically synthesize such a code harness, using a small number of rounds of iterative code refinement given feedback from the (game) environment. The resulting harness prevents all illegal moves in 145 different TextArena games (both 1-player and 2-player), enabling the smaller Gemini-2.5-Flash model to outperform larger models, such as Gemini-2.5-Pro. Pushing our technique to the limit, we can get Gemini-2.5-Flash to generate the entire policy in code, thus eliminating the need to use the LLM at decision making time. The resulting code-policy receives a higher average reward than Gemini-2.5-Pro and GPT-5.2-High on 16 TextArena 1-player games.
|
|
|
|
## Key Contributions
|
|
|
|
1. **Code-as-Harness framework**: LLM synthesizes its own harness — transforms agent from LLM+hand-coded-plumbing to LLM+auto-generated-code
|
|
2. **Thompson Sampling tree search**: structured exploration of code harness space
|
|
3. **Three harness modes**: action-filter, action-verifier, and code-as-policy (zero LLM at inference)
|
|
4. **100% legal moves** across 145 TextArena games; Flash+Harness outperforms Pro
|