Files
myWiki/raw/papers/liu-auditing-agent-harness-safety-2026.md

30 lines
2.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Auditing Agent Harness Safety
**Authors:** Chengzhi Liu\*, Yichen Guo\*, Yepeng Liu, Yuzhe Yang, Qianqi Yan, Xuandong Zhao, Wenyue Hua, Sheng Liu, Sharon Li, Yuheng Bu, Xin Eric Wang
**Affiliations:** UC Santa Barbara, UC Berkeley, Stanford University, UWMadison, Microsoft Research
**arXiv:** [2605.14271](https://arxiv.org/abs/2605.14271) (v2, May 2026)
**Venue:** cs.CL
**Project Page:** [harnessaudit.github.io](https://harnessaudit.github.io)
---
## Abstract
LLM agents increasingly run inside execution harnesses that dispatch tools, allocate resources, and route messages between specialized components. However, a harness can return a correct, benign answer over a trajectory that accesses unauthorized resources or leaks context to the wrong agent. Output-level evaluation cannot see these failures, yet most safety benchmarks score only final outputs or terminal states, even though many violations occur mid-trajectory rather than at termination. The central question is whether the harness respects user intent, permission boundaries, and information-flow constraints throughout execution. To address this gap, we propose **HarnessAudit**, a framework that audits full execution trajectories across **boundary compliance**, **execution fidelity**, and **system stability**, with a focus on multi-agent harnesses where these risks are most pronounced. We further introduce **HarnessAudit-Bench**, a benchmark of 210 tasks across eight real-world domains, instantiated in both single-agent and multi-agent configurations with embedded safety constraints. Evaluating ten harness configurations across frontier models and three multi-agent frameworks, we find that: (i) task completion is misaligned with safe execution, and violations accumulate with trajectory length; (ii) safety risks vary across domains, task types, and agent roles; (iii) most violations concentrate in resource access and inter-agent information transfer; (iv) multi-agent collaboration expands the safety risk surface, while harness design sets the upper bound of safe deployment.
## Key Concepts
- [[agent-harness-safety]] — the core paradigm
- [[harnessaudit]] — the auditing framework
- [[boundary-compliance]] — L1: tool, resource, information-flow violations
- [[execution-fidelity]] — L2: action validity, checkpointed completion
- [[system-stability]] — L3: perturbation resilience
- [[trajectory-auditing]] — trajectory-level evidence collection
- [[multi-agent-safety]] — multi-agent coordination safety risks
- [[information-flow-control]] — inter-agent communication constraints
- [[resource-access-control]] — resource scope enforcement
- [[safety-adherence-rate]] — SAR scoring metric
- [[policy-constrained-execution]] — formal harness model
- [[execution-harness]] — harness as policy-constrained execution system
- [[hidden-audit-channel]] — agent-independent evidence recording