44 lines
1.3 KiB
Markdown
44 lines
1.3 KiB
Markdown
---
|
||
title: "Agent Robustness Evaluation(Agent 鲁棒性评测)"
|
||
created: 2026-05-23
|
||
updated: 2026-05-23
|
||
type: concept
|
||
tags: [agent, robustness, evaluation, fault-tolerance]
|
||
sources: [raw/articles/claw-eval-2026.md]
|
||
confidence: medium
|
||
---
|
||
|
||
# Agent Robustness Evaluation
|
||
|
||
> 评测 Agent 面对接口失败、服务延迟、临时错误时,能否恢复并继续执行。鲁棒性是区分"能做"和"能稳定做"的关键维度。
|
||
|
||
## Claw-Eval 的鲁棒性测试
|
||
|
||
通过**错误注入**模拟真实生产环境的不稳定性:
|
||
- HTTP 429(限流)
|
||
- HTTP 500(服务器错误)
|
||
- 延迟峰值
|
||
|
||
## 关键发现
|
||
|
||
- Pass@3 在错误注入后相对稳定(模型仍然"能做到")
|
||
- Pass^3 最高下降 24 个百分点(但不再"稳定做到")
|
||
- → [[agent-capability-stability-gap|能力 ≠ 稳定性]]
|
||
|
||
## 鲁棒性的维度
|
||
|
||
- **重试策略**:面对临时失败是否尝试恢复
|
||
- **降级策略**:不可恢复时是否优雅降级
|
||
- **错误感知**:是否能识别异常状态并调整行为
|
||
|
||
## 与 ETCLOVG 的关系
|
||
|
||
鲁棒性评测直接检验 [[execution-environment]](E 层)沙箱的故障模式、[[lifecycle-orchestration]](L 层)的恢复策略和 [[observability]](O 层)的故障信号质量。
|
||
|
||
## 相关概念
|
||
|
||
- [[pass-at-k-vs-pass-k]]
|
||
- [[agent-capability-stability-gap]]
|
||
- [[agent-process-evaluation]]
|
||
- [[claw-eval]]
|