20260601
This commit is contained in:
51
concepts/agent-eval-grader.md
Normal file
51
concepts/agent-eval-grader.md
Normal file
@@ -0,0 +1,51 @@
|
||||
---
|
||||
title: "Agent Eval Grader"
|
||||
created: 2026-05-26
|
||||
type: concept
|
||||
tags: ["agent-evaluation", "scoring", "grader"]
|
||||
sources: ["mini-agent-harness"]
|
||||
---
|
||||
|
||||
# Agent Eval Grader
|
||||
|
||||
> Agent 评测中的评分模块——基于规则或测试脚本判断任务执行结果。
|
||||
|
||||
## 定义
|
||||
|
||||
Grader 是 [[agent-harness-mini|mini harness]] 的最终判断模块。它接收 [[agent-eval-trace|trace]] 和最终答案,输出结构化的评分结果(success/fail + reason)。
|
||||
|
||||
## 评分策略演进
|
||||
|
||||
### Level 1:规则匹配(本文推荐)
|
||||
```json
|
||||
{
|
||||
"must_read": ["README.md"],
|
||||
"answer_should_include": "不能确认支持插件系统",
|
||||
"answer_should_not_include": "支持插件系统"
|
||||
}
|
||||
```
|
||||
|
||||
### Level 2:测试脚本
|
||||
```bash
|
||||
# 运行测试验证 Agent 的代码修改是否通过
|
||||
pytest tests/
|
||||
```
|
||||
|
||||
### Level 3:LLM-as-Judge
|
||||
使用 LLM 评估复杂输出(需注意评估者偏差)
|
||||
|
||||
### Level 4:多维度评分
|
||||
任务完成度 + 工具使用效率 + 步骤冗余度 + 幻觉检测
|
||||
|
||||
## 设计原则
|
||||
|
||||
- **可检查性**:评分规则必须明确可执行
|
||||
- **可解释性**:失败必须给出 reason
|
||||
- **渐进复杂度**:从规则开始,按需升级
|
||||
|
||||
## 相关页面
|
||||
|
||||
- [[agent-eval-trace]] — Grader 的输入数据源
|
||||
- [[agent-eval-case-design]] — 包含 grader 配置的评测用例
|
||||
- [[agent-harness-mini]] — 包含 grader 模块的完整 harness
|
||||
- [[agent-evaluation-paradigm-shift]] — 评测范式的整体转变
|
||||
Reference in New Issue
Block a user