20260601
This commit is contained in:
38
concepts/pass-at-k-vs-pass-k.md
Normal file
38
concepts/pass-at-k-vs-pass-k.md
Normal file
@@ -0,0 +1,38 @@
|
||||
---
|
||||
title: "Pass@k vs Pass^k(能力上限 vs 可靠性下限)"
|
||||
created: 2026-05-23
|
||||
updated: 2026-05-23
|
||||
type: concept
|
||||
tags: [agent, evaluation, reliability, metric]
|
||||
sources: [raw/articles/claw-eval-2026.md]
|
||||
confidence: high
|
||||
---
|
||||
|
||||
# Pass@k vs Pass^k
|
||||
|
||||
> 区分"能力"与"稳定性"的评估指标:Pass@k 度量能力上限,Pass^k 度量可靠性下限。两者之间的差距揭示了不稳定性的程度。
|
||||
|
||||
## 定义
|
||||
|
||||
- **Pass@k**:k 次尝试中至少成功一次 → 接近**能力上限**(模型能做到什么)
|
||||
- **Pass^k**:k 次全部成功 → 接近**可靠性下限**(模型稳定能做什么)
|
||||
|
||||
## Claw-Eval 的关键发现
|
||||
|
||||
在错误注入实验中(HTTP 429、HTTP 500、延迟峰值):
|
||||
- Pass@3 相对稳定
|
||||
- **Pass^3 最高下降 24 个百分点**
|
||||
|
||||
→ 一次成功不能代表稳定可用。
|
||||
|
||||
## 工程含义
|
||||
|
||||
Pass@k 和 Pass^k 的**差距**是衡量 Agent 鲁棒性的关键指标:
|
||||
- 差距小 → Agent 稳定可靠,适合生产部署
|
||||
- 差距大 → Agent 表现波动大,需要 [[agent-robustness-evaluation]] 和 [[agent-safety-evaluation]] 改进
|
||||
|
||||
## 相关概念
|
||||
|
||||
- [[agent-capability-stability-gap]]
|
||||
- [[agent-robustness-evaluation]]
|
||||
- [[claw-eval]]
|
||||
Reference in New Issue
Block a user