Files
myWiki/concepts/agent-robustness-evaluation.md
2026-06-01 10:46:01 +08:00

44 lines
1.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Agent Robustness EvaluationAgent 鲁棒性评测)"
created: 2026-05-23
updated: 2026-05-23
type: concept
tags: [agent, robustness, evaluation, fault-tolerance]
sources: [raw/articles/claw-eval-2026.md]
confidence: medium
---
# Agent Robustness Evaluation
> 评测 Agent 面对接口失败、服务延迟、临时错误时,能否恢复并继续执行。鲁棒性是区分"能做"和"能稳定做"的关键维度。
## Claw-Eval 的鲁棒性测试
通过**错误注入**模拟真实生产环境的不稳定性:
- HTTP 429限流
- HTTP 500服务器错误
- 延迟峰值
## 关键发现
- Pass@3 在错误注入后相对稳定(模型仍然"能做到"
- Pass^3 最高下降 24 个百分点(但不再"稳定做到"
- → [[agent-capability-stability-gap|能力 ≠ 稳定性]]
## 鲁棒性的维度
- **重试策略**:面对临时失败是否尝试恢复
- **降级策略**:不可恢复时是否优雅降级
- **错误感知**:是否能识别异常状态并调整行为
## 与 ETCLOVG 的关系
鲁棒性评测直接检验 [[execution-environment]]E 层)沙箱的故障模式、[[lifecycle-orchestration]]L 层)的恢复策略和 [[observability]]O 层)的故障信号质量。
## 相关概念
- [[pass-at-k-vs-pass-k]]
- [[agent-capability-stability-gap]]
- [[agent-process-evaluation]]
- [[claw-eval]]