SidneyZhang/myWiki

Files

Sidney Zhang 91fac5b6fc

20260617:目前有914 页

2026-06-17 15:02:40 +08:00

2.3 KiB

Raw Blame History

title, created, updated, type, tags, sources, confidence

title

created

updated

type

tags

sources

confidence

Agent Observability（Agent 可观测性）

2026-05-30

2026-05-30

concept

agent

observability

monitoring

tracing

production

agent-harness-engineering-survey

high

Agent Observability

ETCLOVG 的 O 层：对 Agent 行为进行监控、调试和生产级可靠性管理的独立架构层。

核心定义

Agent 可观测性是 ETCLOVG 七层分类法中的第五层（O），从 Lifecycle Hooks 中独立出来作为第一等架构问题。它涵盖 Agent 系统的结构化追踪、成本归因、可靠性工程和统一观测四个方面。

为什么 O 层需要独立

论文将 Observability 从 Lifecycle 的附属品提升为独立层，理由是：

O 层拥有专属平台生态：Langfuse、Arize Phoenix、OpenLLMetry 等
在生产线部署中由不同团队负责（SRE vs Agent 开发）
O 层的工程实践与编排逻辑有本质区别

四大子系统

1. 追踪与监控

Langfuse / Opik / Arize Phoenix / MLflow：交互式 trace tree，延迟火焰图，token 分解
OpenTelemetry (OTel)：成为 Agent 观测的事实标准
语义约定：定义 span 属性（模型名、温度、token 数、延迟）

2. Agent 专用运维平台

AgentOps、RagaAI Catalyst、Laminar、Watson、AgentLens
提供 Agent 特有的调试和回溯功能

3. 成本追踪与优化

TensorZero、Helicone：成本归因和网关
FrugalGPT、GPTCache：成本节省策略
Dual-Pool Routing：模型路由优化

4. 可靠性工程

Anthropic 的 Effective Harnesses 和 Harness Design
AgentErrorTaxonomy：错误分类
SentinelAgent / AgentFixer：故障检测和修复
核心命题："基础设施噪声"可度量地改变 benchmark 分数

相关概念

etclovg-taxonomy — 七层分类体系
lifecycle-orchestration — 编排层（O 层从中独立）
open-telemetry — 事实标准
logfire — Pydantic 生态的 OTel 可观测平台，4 行代码接入，SQL 查询 trace
drift-detection — 在"第 47 次报错"前看到"第 32 次开始不对劲"
agent-harness-engineering — 总体框架
cost-quality-speed-trilemma — 成本维度
pydantic-three-piece-suite — 从校验到可观测到 Agent 类型安全