20260601

2026-06-01 10:46:01 +08:00
parent 2faf4bb002
commit e96b955fda
221 changed files with 10219 additions and 332 deletions
--- a/concepts/trace-native-evaluation.md
+++ b/concepts/trace-native-evaluation.md
@@ -0,0 +1,41 @@
+---
+title: "Trace-Native Evaluation（踪迹原生评估）"
+created: 2026-05-23
+updated: 2026-05-23
+type: concept
+tags: [agent, evaluation, tracing, diagnosis, regression]
+sources: [raw/papers/agent-harness-engineering-survey-2026.md]
+confidence: medium
+---
+
+# Trace-Native Evaluation
+
+> 将 Agent 踪迹（trace）作为评估的主要对象，而非仅看最终通过/失败分数。从踪迹中计算结果分数、轨迹质量、失败归因和回归测试。
+
+## 为什么需要 Trace-Native？
+
+当前评估以**最终分数为中心**（final-score-centric）：一次运行通过或失败，分数归因于模型质量。但实际上失败可能来自：
+
+- 模型推理错误
+- 误导性工具 Schema
+- 沙箱配置错误
+- 陈旧上下文
+- 不稳定测试
+- Benchmark 歧义
+- Judge 不稳定
+- 编排循环 bug
+
+## 闭合观测-评估回路
+
+- 将异常生产踪迹转化为回归案例
+- 直接从 spans 计算轨迹质量指标
+- 将诊断信号反馈到 prompt、tool、context 和编排变更
+
+Reflexion（Shinn et al., 2023）证明 Agent 可以在短视距设置中从自己的踪迹学习；将此扩展到长时间运行的多会话 Harness 仍待解决。
+
+## 相关概念
+
+- [[verification-evaluation]] — V 层评估
+- [[observability]] — O 层产生踪迹
+- [[harness-coupling-problem]] — 失败归因需要跨层分析
+- [[agent-harness-engineering-survey]]