20260601
This commit is contained in:
36
concepts/verification-evaluation.md
Normal file
36
concepts/verification-evaluation.md
Normal file
@@ -0,0 +1,36 @@
|
||||
---
|
||||
title: "Verification & Evaluation(验证与评估)"
|
||||
created: 2026-05-23
|
||||
updated: 2026-05-23
|
||||
type: concept
|
||||
tags: [agent, evaluation, verification, benchmark, regression]
|
||||
sources: [raw/papers/agent-harness-engineering-survey-2026.md]
|
||||
confidence: high
|
||||
---
|
||||
|
||||
# Verification & Evaluation(V 层)
|
||||
|
||||
> ETCLOVG 的 V 层:将任务和追踪转化为评估、失败归因和回归反馈。应作为**测量工具**而非排行榜生成器来研究。
|
||||
|
||||
## 三个子层
|
||||
|
||||
- **任务和 Benchmark 基准化**:SWE-bench, WebArena 等
|
||||
- **预执行就绪验证(Readiness Validation)**:跨层的工具就绪检查、沙箱状态校验
|
||||
- **受控执行与追踪捕获**:可复现的运行环境
|
||||
- **多级判断与失败归因**:不止 pass/fail,区分模型推理错误 vs harness 配置错误
|
||||
- **持续回归与部署反馈**:将评估嵌入 CI/CD
|
||||
|
||||
## 核心批评:final-score-centric 的问题
|
||||
|
||||
当前评估过于以最终分数为中心:一次运行通过或失败,最终数字被视为模型质量的证据。但实际上:
|
||||
- 失败可能源自模型推理、误导性工具 Schema、沙箱配置错误、陈旧上下文、不稳定测试、benchmark 歧义、judge 不稳定或编排循环
|
||||
- Anthropic 分析表明基础设施设置可测量地改变 benchmark 分数
|
||||
- 单次运行通过率可能隐藏显著方差(Bjarnason et al., 2026)
|
||||
|
||||
## 未来方向:[[trace-native-evaluation]]
|
||||
|
||||
## 相关概念
|
||||
|
||||
- [[observability]] — O 层与 V 层需闭合回路
|
||||
- [[harness-coupling-problem]] — 评估受执行环境影响
|
||||
- [[agent-harness-engineering-survey]]
|
||||
Reference in New Issue
Block a user