20260601

2026-06-01 10:46:01 +08:00
parent 2faf4bb002
commit e96b955fda
221 changed files with 10219 additions and 332 deletions
--- a/concepts/agent-robustness-evaluation.md
+++ b/concepts/agent-robustness-evaluation.md
@@ -0,0 +1,43 @@
+---
+title: "Agent Robustness Evaluation（Agent 鲁棒性评测）"
+created: 2026-05-23
+updated: 2026-05-23
+type: concept
+tags: [agent, robustness, evaluation, fault-tolerance]
+sources: [raw/articles/claw-eval-2026.md]
+confidence: medium
+---
+
+# Agent Robustness Evaluation
+
+> 评测 Agent 面对接口失败、服务延迟、临时错误时，能否恢复并继续执行。鲁棒性是区分"能做"和"能稳定做"的关键维度。
+
+## Claw-Eval 的鲁棒性测试
+
+通过**错误注入**模拟真实生产环境的不稳定性：
+- HTTP 429（限流）
+- HTTP 500（服务器错误）
+- 延迟峰值
+
+## 关键发现
+
+- Pass@3 在错误注入后相对稳定（模型仍然"能做到"）
+- Pass^3 最高下降 24 个百分点（但不再"稳定做到"）
+- → [[agent-capability-stability-gap|能力 ≠ 稳定性]]
+
+## 鲁棒性的维度
+
+- **重试策略**：面对临时失败是否尝试恢复
+- **降级策略**：不可恢复时是否优雅降级
+- **错误感知**：是否能识别异常状态并调整行为
+
+## 与 ETCLOVG 的关系
+
+鲁棒性评测直接检验 [[execution-environment]]（E 层）沙箱的故障模式、[[lifecycle-orchestration]]（L 层）的恢复策略和 [[observability]]（O 层）的故障信号质量。
+
+## 相关概念
+
+- [[pass-at-k-vs-pass-k]]
+- [[agent-capability-stability-gap]]
+- [[agent-process-evaluation]]
+- [[claw-eval]]