Files
myWiki/concepts/agent-capability-stability-gap.md
2026-06-01 10:46:01 +08:00

39 lines
1.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Agent Capability-Stability Gap能力-稳定性差距)"
created: 2026-05-23
updated: 2026-05-23
type: concept
tags: [agent, capability, stability, reliability]
sources: [raw/articles/claw-eval-2026.md]
confidence: medium
---
# Agent Capability-Stability Gap
> Agent 的"能做到"(能力上限)与"稳定做到"(可靠性下限)之间存在显著差距。这个差距在错误注入后急剧扩大。
## 度量方法
- **Capability** ← [[pass-at-k-vs-pass-k|Pass@k]]k 次中至少成功一次
- **Stability** ← [[pass-at-k-vs-pass-k|Pass^k]]k 次全部成功
- **Gap** = Pass@k Pass^k
## Claw-Eval 实验
- 正常环境下 gap 已存在
- 错误注入后 gap 急剧扩大Pass^3 下降达 24pp
- 多模态任务中最高 Pass^3 仅 25.7%——所有模型的 gap 都很大
## 工程含义
对部署决策的影响:
- 窄 gap → Agent 适合生产环境
- 宽 gap → 需要增强 [[agent-robustness-evaluation]]、改进错误恢复策略或调整 [[harness-coupling-problem|Harness 配置]]
## 相关概念
- [[pass-at-k-vs-pass-k]]
- [[agent-robustness-evaluation]]
- [[binding-constraint-thesis]] — 稳定性问题可能源自 Harness 而非模型
- [[claw-eval]]