myWiki/raw/papers/lou-autoharness-2026.md

---
title: "AutoHarness: improving LLM agents by automatically synthesizing a code harness"
created: 2026-05-29
type: paper-raw
arxiv: "2603.03329"
authors: ["Xinghua Lou", "Miguel Lázaro-Gredilla", "Antoine Dedieu", "Carter Wendelken", "Wolfgang Lehrach", "Kevin P. Murphy"]
venue: "arXiv preprint (cs.CL), February 2026"
affiliation: "Google DeepMind"
tags: ["agent", "code-synthesis", "game-playing", "harness", "LLM"]
---

# AutoHarness: improving LLM agents by automatically synthesizing a code harness

**Authors:** Xinghua Lou, Miguel Lázaro-Gredilla, Antoine Dedieu, Carter Wendelken, Wolfgang Lehrach, Kevin P. Murphy
**Affiliation:** Google DeepMind
**arXiv:** [2603.03329](https://arxiv.org/abs/2603.03329) (v1, 10 February 2026)
**Category:** cs.CL (Computation and Language)

## Abstract

Despite significant strides in language models in the last few years, when used as agents, such models often try to perform actions that are not just suboptimal for a given state, but are strictly prohibited by the external environment. For example, in the recent Kaggle GameArena chess competition, 78% of Gemini-2.5-Flash losses were attributed to illegal moves. Often people manually write "harnesses" around LLMs to prevent such failures. In this paper, we demonstrate that Gemini-2.5-Flash can automatically synthesize such a code harness, using a small number of rounds of iterative code refinement given feedback from the (game) environment. The resulting harness prevents all illegal moves in 145 different TextArena games (both 1-player and 2-player), enabling the smaller Gemini-2.5-Flash model to outperform larger models, such as Gemini-2.5-Pro. Pushing our technique to the limit, we can get Gemini-2.5-Flash to generate the entire policy in code, thus eliminating the need to use the LLM at decision making time. The resulting code-policy receives a higher average reward than Gemini-2.5-Pro and GPT-5.2-High on 16 TextArena 1-player games.

## Key Contributions

1. **Code-as-Harness framework**: LLM synthesizes its own harness — transforms agent from LLM+hand-coded-plumbing to LLM+auto-generated-code
2. **Thompson Sampling tree search**: structured exploration of code harness space
3. **Three harness modes**: action-filter, action-verifier, and code-as-policy (zero LLM at inference)
4. **100% legal moves** across 145 TextArena games; Flash+Harness outperforms Pro