20260617:目前有914 页

2026-06-17 15:02:40 +08:00
parent e96b955fda
commit 91fac5b6fc
423 changed files with 20687 additions and 34 deletions
--- a/concepts/patch-based-evaluation.md
+++ b/concepts/patch-based-evaluation.md
@@ -0,0 +1,44 @@
+---
+title: "Patch-Based Evaluation (基于 Patch 的评测合约)"
+created: 2026-06-15
+updated: 2026-06-15
+type: concept
+tags: [benchmark, evaluation, coding-agent]
+sources: [raw/papers/zheng-claw-swe-bench-2026.md]
+---
+
+# Patch-Based Evaluation
+
+## 定义
+
+Patch-Based Evaluation 是 SWE-bench 的核心评测合约：给定一个 GitHub issue 的 `problem_statement`、target `repo` 和 `base_commit`，系统必须提交一个可 apply 到仓库 checkout 的 diff patch。官方评测 harness 不读取交互轨迹或最终自然语言答案——它只读取包含 `instance_id`、`model_name_or_path` 和 `model_patch` 字符串的预测文件。
+
+## 与传统 Agent 评测的区别
+
+| | 传统 Agent 评测 | Patch-Based Evaluation |
+|---|---|---|
+| 输出格式 | 最终文本/JSON/自然语言 | Git diff patch |
+| 评分方式 | 解析输出内容 | 仓库级测试通过 |
+| Agent 行为 | 自由交互 | 必须编辑仓库文件 |
+| 合约要求 | 低 | 高（patch 必须可 apply） |
+
+## 核心挑战
+
+1. **通用 Agent 不满足合约:** OpenClaw 等通用 agent 通常输出最终文本或结构化消息，evaluator 无法直接评分
+2. **Patch 生成的脆弱性:** 直接生成 unified diff 文本极易出错——行号偏差、上下文错误、hunk header 不匹配
+3. **非解决方案产物的污染:** Agent 可能创建 session 文件、缓存等，这些进入 git diff 会污染 patch
+
+## Full Adapter 的解决方案
+
+Claw-SWE-Bench 的 Full Adapter 将输出责任从"模型写 patch 文本"转移到"模型编辑仓库文件，runner 从 Git 状态导出 patch"：
+- Agent 通过工具编辑 `/testbed` 下的文件
+- Runner 计算对 `base_commit` 的 diff
+- 移除非解决方案产物
+- 写入 SWE-bench 兼容的预测文件
+
+结果：Apply Failed 从 69.1% 降至 <1.5%。
+
+## 参考
+- [[claw-swe-bench|Claw-SWE-Bench 论文]]
+- [[adapter-protocol|适配器协议]]
+- [[swe-bench|SWE-bench]]