20260601
This commit is contained in:
43
concepts/agent-robustness-evaluation.md
Normal file
43
concepts/agent-robustness-evaluation.md
Normal file
@@ -0,0 +1,43 @@
|
||||
---
|
||||
title: "Agent Robustness Evaluation(Agent 鲁棒性评测)"
|
||||
created: 2026-05-23
|
||||
updated: 2026-05-23
|
||||
type: concept
|
||||
tags: [agent, robustness, evaluation, fault-tolerance]
|
||||
sources: [raw/articles/claw-eval-2026.md]
|
||||
confidence: medium
|
||||
---
|
||||
|
||||
# Agent Robustness Evaluation
|
||||
|
||||
> 评测 Agent 面对接口失败、服务延迟、临时错误时,能否恢复并继续执行。鲁棒性是区分"能做"和"能稳定做"的关键维度。
|
||||
|
||||
## Claw-Eval 的鲁棒性测试
|
||||
|
||||
通过**错误注入**模拟真实生产环境的不稳定性:
|
||||
- HTTP 429(限流)
|
||||
- HTTP 500(服务器错误)
|
||||
- 延迟峰值
|
||||
|
||||
## 关键发现
|
||||
|
||||
- Pass@3 在错误注入后相对稳定(模型仍然"能做到")
|
||||
- Pass^3 最高下降 24 个百分点(但不再"稳定做到")
|
||||
- → [[agent-capability-stability-gap|能力 ≠ 稳定性]]
|
||||
|
||||
## 鲁棒性的维度
|
||||
|
||||
- **重试策略**:面对临时失败是否尝试恢复
|
||||
- **降级策略**:不可恢复时是否优雅降级
|
||||
- **错误感知**:是否能识别异常状态并调整行为
|
||||
|
||||
## 与 ETCLOVG 的关系
|
||||
|
||||
鲁棒性评测直接检验 [[execution-environment]](E 层)沙箱的故障模式、[[lifecycle-orchestration]](L 层)的恢复策略和 [[observability]](O 层)的故障信号质量。
|
||||
|
||||
## 相关概念
|
||||
|
||||
- [[pass-at-k-vs-pass-k]]
|
||||
- [[agent-capability-stability-gap]]
|
||||
- [[agent-process-evaluation]]
|
||||
- [[claw-eval]]
|
||||
Reference in New Issue
Block a user