Files
myWiki/concepts/verification-evaluation.md
2026-06-01 10:46:01 +08:00

37 lines
1.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Verification & Evaluation验证与评估"
created: 2026-05-23
updated: 2026-05-23
type: concept
tags: [agent, evaluation, verification, benchmark, regression]
sources: [raw/papers/agent-harness-engineering-survey-2026.md]
confidence: high
---
# Verification & EvaluationV 层)
> ETCLOVG 的 V 层:将任务和追踪转化为评估、失败归因和回归反馈。应作为**测量工具**而非排行榜生成器来研究。
## 三个子层
- **任务和 Benchmark 基准化**SWE-bench, WebArena 等
- **预执行就绪验证Readiness Validation**:跨层的工具就绪检查、沙箱状态校验
- **受控执行与追踪捕获**:可复现的运行环境
- **多级判断与失败归因**:不止 pass/fail区分模型推理错误 vs harness 配置错误
- **持续回归与部署反馈**:将评估嵌入 CI/CD
## 核心批评final-score-centric 的问题
当前评估过于以最终分数为中心:一次运行通过或失败,最终数字被视为模型质量的证据。但实际上:
- 失败可能源自模型推理、误导性工具 Schema、沙箱配置错误、陈旧上下文、不稳定测试、benchmark 歧义、judge 不稳定或编排循环
- Anthropic 分析表明基础设施设置可测量地改变 benchmark 分数
- 单次运行通过率可能隐藏显著方差Bjarnason et al., 2026
## 未来方向:[[trace-native-evaluation]]
## 相关概念
- [[observability]] — O 层与 V 层需闭合回路
- [[harness-coupling-problem]] — 评估受执行环境影响
- [[agent-harness-engineering-survey]]