20260601
This commit is contained in:
37
concepts/agent-process-evaluation.md
Normal file
37
concepts/agent-process-evaluation.md
Normal file
@@ -0,0 +1,37 @@
|
||||
---
|
||||
title: "Agent Process Evaluation(过程评测)"
|
||||
created: 2026-05-23
|
||||
updated: 2026-05-23
|
||||
type: concept
|
||||
tags: [agent, evaluation, process, trace]
|
||||
sources: [raw/articles/claw-eval-2026.md]
|
||||
confidence: high
|
||||
---
|
||||
|
||||
# Agent Process Evaluation(过程评测)
|
||||
|
||||
> 不只评判 Agent 的最终输出,更审查其完整的执行过程——中间步骤是否合理、工具调用是否正确、约束是否遵守。
|
||||
|
||||
## 为什么只看最终答案不够
|
||||
|
||||
- Agent 可能给出看似合理的结果,却在执行中遗漏关键步骤
|
||||
- Claw-Eval 实验:普通 LLM Judge 即使看到完整对话记录,仍**漏掉 44% 安全违规**和**13% 鲁棒性问题**
|
||||
- 需要结合**服务端日志**和**环境快照**才能捕捉违规
|
||||
|
||||
## 过程评测的关键要素
|
||||
|
||||
- **工具调用审计**:每一步工具调用是否符合预期
|
||||
- **约束遵循**:行为是否遵守安全边界和任务约束
|
||||
- **错误恢复**:异常发生后是否尝试恢复
|
||||
- **轨迹完整性**:Setup → Execution → Judge 全生命周期记录
|
||||
|
||||
## 与 Trace-Native Evaluation 的关系
|
||||
|
||||
过程评测是 [[trace-native-evaluation]] 的具体实践——将 Agent 的完整执行踪迹而非最终分数作为主要评估对象。
|
||||
|
||||
## 相关概念
|
||||
|
||||
- [[agent-evaluation-paradigm-shift]]
|
||||
- [[agent-safety-evaluation]]
|
||||
- [[agent-robustness-evaluation]]
|
||||
- [[claw-eval]]
|
||||
Reference in New Issue
Block a user