Files
myWiki/concepts/agent-eval-grader.md
2026-06-01 10:46:01 +08:00

52 lines
1.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Agent Eval Grader"
created: 2026-05-26
type: concept
tags: ["agent-evaluation", "scoring", "grader"]
sources: ["mini-agent-harness"]
---
# Agent Eval Grader
> Agent 评测中的评分模块——基于规则或测试脚本判断任务执行结果。
## 定义
Grader 是 [[agent-harness-mini|mini harness]] 的最终判断模块。它接收 [[agent-eval-trace|trace]] 和最终答案输出结构化的评分结果success/fail + reason
## 评分策略演进
### Level 1规则匹配本文推荐
```json
{
"must_read": ["README.md"],
"answer_should_include": "不能确认支持插件系统",
"answer_should_not_include": "支持插件系统"
}
```
### Level 2测试脚本
```bash
# 运行测试验证 Agent 的代码修改是否通过
pytest tests/
```
### Level 3LLM-as-Judge
使用 LLM 评估复杂输出(需注意评估者偏差)
### Level 4多维度评分
任务完成度 + 工具使用效率 + 步骤冗余度 + 幻觉检测
## 设计原则
- **可检查性**:评分规则必须明确可执行
- **可解释性**:失败必须给出 reason
- **渐进复杂度**:从规则开始,按需升级
## 相关页面
- [[agent-eval-trace]] — Grader 的输入数据源
- [[agent-eval-case-design]] — 包含 grader 配置的评测用例
- [[agent-harness-mini]] — 包含 grader 模块的完整 harness
- [[agent-evaluation-paradigm-shift]] — 评测范式的整体转变