SidneyZhang/myWiki

Files

Sidney Zhang e96b955fda

20260601

2026-06-01 10:46:01 +08:00

1.4 KiB

Raw Blame History

title, created, updated, type, tags, sources, confidence

title

created

updated

type

tags

sources

confidence

Trace-Native Evaluation（踪迹原生评估）

2026-05-23

2026-05-23

concept

agent

evaluation

tracing

diagnosis

regression

raw/papers/agent-harness-engineering-survey-2026.md

medium

Trace-Native Evaluation

将 Agent 踪迹（trace）作为评估的主要对象，而非仅看最终通过/失败分数。从踪迹中计算结果分数、轨迹质量、失败归因和回归测试。

为什么需要 Trace-Native？

当前评估以最终分数为中心（final-score-centric）：一次运行通过或失败，分数归因于模型质量。但实际上失败可能来自：

模型推理错误
误导性工具 Schema
沙箱配置错误
陈旧上下文
不稳定测试
Benchmark 歧义
Judge 不稳定
编排循环 bug

闭合观测-评估回路

将异常生产踪迹转化为回归案例
直接从 spans 计算轨迹质量指标
将诊断信号反馈到 prompt、tool、context 和编排变更

Reflexion（Shinn et al., 2023）证明 Agent 可以在短视距设置中从自己的踪迹学习；将此扩展到长时间运行的多会话 Harness 仍待解决。

相关概念

verification-evaluation — V 层评估
observability — O 层产生踪迹
harness-coupling-problem — 失败归因需要跨层分析
agent-harness-engineering-survey