20260601
This commit is contained in:
38
concepts/agent-capability-stability-gap.md
Normal file
38
concepts/agent-capability-stability-gap.md
Normal file
@@ -0,0 +1,38 @@
|
||||
---
|
||||
title: "Agent Capability-Stability Gap(能力-稳定性差距)"
|
||||
created: 2026-05-23
|
||||
updated: 2026-05-23
|
||||
type: concept
|
||||
tags: [agent, capability, stability, reliability]
|
||||
sources: [raw/articles/claw-eval-2026.md]
|
||||
confidence: medium
|
||||
---
|
||||
|
||||
# Agent Capability-Stability Gap
|
||||
|
||||
> Agent 的"能做到"(能力上限)与"稳定做到"(可靠性下限)之间存在显著差距。这个差距在错误注入后急剧扩大。
|
||||
|
||||
## 度量方法
|
||||
|
||||
- **Capability** ← [[pass-at-k-vs-pass-k|Pass@k]]:k 次中至少成功一次
|
||||
- **Stability** ← [[pass-at-k-vs-pass-k|Pass^k]]:k 次全部成功
|
||||
- **Gap** = Pass@k − Pass^k
|
||||
|
||||
## Claw-Eval 实验
|
||||
|
||||
- 正常环境下 gap 已存在
|
||||
- 错误注入后 gap 急剧扩大(Pass^3 下降达 24pp)
|
||||
- 多模态任务中最高 Pass^3 仅 25.7%——所有模型的 gap 都很大
|
||||
|
||||
## 工程含义
|
||||
|
||||
对部署决策的影响:
|
||||
- 窄 gap → Agent 适合生产环境
|
||||
- 宽 gap → 需要增强 [[agent-robustness-evaluation]]、改进错误恢复策略或调整 [[harness-coupling-problem|Harness 配置]]
|
||||
|
||||
## 相关概念
|
||||
|
||||
- [[pass-at-k-vs-pass-k]]
|
||||
- [[agent-robustness-evaluation]]
|
||||
- [[binding-constraint-thesis]] — 稳定性问题可能源自 Harness 而非模型
|
||||
- [[claw-eval]]
|
||||
Reference in New Issue
Block a user