20260601
This commit is contained in:
37
concepts/agent-safety-evaluation.md
Normal file
37
concepts/agent-safety-evaluation.md
Normal file
@@ -0,0 +1,37 @@
|
||||
---
|
||||
title: "Agent Safety Evaluation(Agent 安全评测)"
|
||||
created: 2026-05-23
|
||||
updated: 2026-05-23
|
||||
type: concept
|
||||
tags: [agent, safety, evaluation, security]
|
||||
sources: [raw/articles/claw-eval-2026.md]
|
||||
confidence: medium
|
||||
---
|
||||
|
||||
# Agent Safety Evaluation
|
||||
|
||||
> 评测 Agent 在执行过程中是否遵守约束,是否避免不该发生的行为。不仅看结果是否正确,还要看过程是否安全。
|
||||
|
||||
## Claw-Eval 的安全评测发现
|
||||
|
||||
- 普通 LLM Judge 即使看到完整对话记录和工具调用信息,仍然**漏掉了 44% 的安全违规**
|
||||
- 安全违规不能仅从文本记录中检测 → 需要结合服务端日志和环境快照
|
||||
|
||||
## 安全评测的挑战
|
||||
|
||||
- **隐蔽性**:安全违规可能不体现在对话文本中(如未经授权的 API 调用)
|
||||
- **上下文依赖**:同一个操作在不同任务约束下安全与否不同
|
||||
- **检测能力不足**:纯 LLM Judge 对安全边界的判断力有限
|
||||
|
||||
## 与 Governance 层的关联
|
||||
|
||||
安全评测是 [[governance-security]](G 层)的回馈闭环:
|
||||
- 评测暴露的安全漏洞 → 加固 Governance 策略
|
||||
- 同时验证 Governance 层的护栏是否真正有效
|
||||
|
||||
## 相关概念
|
||||
|
||||
- [[agent-process-evaluation]]
|
||||
- [[agent-robustness-evaluation]]
|
||||
- [[governance-security]]
|
||||
- [[claw-eval]]
|
||||
Reference in New Issue
Block a user