30 lines
2.9 KiB
Markdown
30 lines
2.9 KiB
Markdown
# Auditing Agent Harness Safety
|
||
|
||
**Authors:** Chengzhi Liu\*, Yichen Guo\*, Yepeng Liu, Yuzhe Yang, Qianqi Yan, Xuandong Zhao, Wenyue Hua, Sheng Liu, Sharon Li, Yuheng Bu, Xin Eric Wang
|
||
**Affiliations:** UC Santa Barbara, UC Berkeley, Stanford University, UW–Madison, Microsoft Research
|
||
**arXiv:** [2605.14271](https://arxiv.org/abs/2605.14271) (v2, May 2026)
|
||
**Venue:** cs.CL
|
||
**Project Page:** [harnessaudit.github.io](https://harnessaudit.github.io)
|
||
|
||
---
|
||
|
||
## Abstract
|
||
|
||
LLM agents increasingly run inside execution harnesses that dispatch tools, allocate resources, and route messages between specialized components. However, a harness can return a correct, benign answer over a trajectory that accesses unauthorized resources or leaks context to the wrong agent. Output-level evaluation cannot see these failures, yet most safety benchmarks score only final outputs or terminal states, even though many violations occur mid-trajectory rather than at termination. The central question is whether the harness respects user intent, permission boundaries, and information-flow constraints throughout execution. To address this gap, we propose **HarnessAudit**, a framework that audits full execution trajectories across **boundary compliance**, **execution fidelity**, and **system stability**, with a focus on multi-agent harnesses where these risks are most pronounced. We further introduce **HarnessAudit-Bench**, a benchmark of 210 tasks across eight real-world domains, instantiated in both single-agent and multi-agent configurations with embedded safety constraints. Evaluating ten harness configurations across frontier models and three multi-agent frameworks, we find that: (i) task completion is misaligned with safe execution, and violations accumulate with trajectory length; (ii) safety risks vary across domains, task types, and agent roles; (iii) most violations concentrate in resource access and inter-agent information transfer; (iv) multi-agent collaboration expands the safety risk surface, while harness design sets the upper bound of safe deployment.
|
||
|
||
## Key Concepts
|
||
|
||
- [[agent-harness-safety]] — the core paradigm
|
||
- [[harnessaudit]] — the auditing framework
|
||
- [[boundary-compliance]] — L1: tool, resource, information-flow violations
|
||
- [[execution-fidelity]] — L2: action validity, checkpointed completion
|
||
- [[system-stability]] — L3: perturbation resilience
|
||
- [[trajectory-auditing]] — trajectory-level evidence collection
|
||
- [[multi-agent-safety]] — multi-agent coordination safety risks
|
||
- [[information-flow-control]] — inter-agent communication constraints
|
||
- [[resource-access-control]] — resource scope enforcement
|
||
- [[safety-adherence-rate]] — SAR scoring metric
|
||
- [[policy-constrained-execution]] — formal harness model
|
||
- [[execution-harness]] — harness as policy-constrained execution system
|
||
- [[hidden-audit-channel]] — agent-independent evidence recording
|