Files
myWiki/concepts/trace-native-evaluation.md
2026-06-01 10:46:01 +08:00

42 lines
1.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Trace-Native Evaluation踪迹原生评估"
created: 2026-05-23
updated: 2026-05-23
type: concept
tags: [agent, evaluation, tracing, diagnosis, regression]
sources: [raw/papers/agent-harness-engineering-survey-2026.md]
confidence: medium
---
# Trace-Native Evaluation
> 将 Agent 踪迹trace作为评估的主要对象而非仅看最终通过/失败分数。从踪迹中计算结果分数、轨迹质量、失败归因和回归测试。
## 为什么需要 Trace-Native
当前评估以**最终分数为中心**final-score-centric一次运行通过或失败分数归因于模型质量。但实际上失败可能来自
- 模型推理错误
- 误导性工具 Schema
- 沙箱配置错误
- 陈旧上下文
- 不稳定测试
- Benchmark 歧义
- Judge 不稳定
- 编排循环 bug
## 闭合观测-评估回路
- 将异常生产踪迹转化为回归案例
- 直接从 spans 计算轨迹质量指标
- 将诊断信号反馈到 prompt、tool、context 和编排变更
ReflexionShinn et al., 2023证明 Agent 可以在短视距设置中从自己的踪迹学习;将此扩展到长时间运行的多会话 Harness 仍待解决。
## 相关概念
- [[verification-evaluation]] — V 层评估
- [[observability]] — O 层产生踪迹
- [[harness-coupling-problem]] — 失败归因需要跨层分析
- [[agent-harness-engineering-survey]]