Files
myWiki/concepts/swe-bench.md

41 lines
1.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "SWE-bench"
created: 2026-06-15
updated: 2026-06-15
type: concept
tags: [benchmark, evaluation, coding-agent]
sources: [raw/papers/zheng-claw-swe-bench-2026.md]
---
# SWE-bench
## 定义
SWE-bench 是仓库级代码 agent 评测的事实标准。它基于真实的 GitHub issue要求系统提交一个可 apply 到仓库的 diff patch由仓库级测试判定是否解决。核心评分合约给定 `problem_statement``repo``base_commit`,系统提交 `model_patch` → evaluator apply patch → 运行测试 → Resolved/Not Resolved。
## 关键组成部分
- **SWE-bench:** 原始 Python 仓库 issue-resolution benchmark
- **SWE-bench-Multilingual:** 扩展到 7 种非 Python 语言Java, Go, Rust, JS/TS, C/C++, Ruby, PHP贡献 300 个实例
- **SWE-bench-Verified-Mini:** 人类验证的 Python 子集,贡献 50 个实例
## Claw-SWE-Bench 的定位
Claw-SWE-Bench 将 SWE-bench 的评测范式从"单系统报告"升级为"受控实验"
- 保持 SWE-bench 的 patch-based 评测合约
- 将 agent harness 作为受控实验变量
- 添加代价会计作为第一等评测轴
- 提供标准化的 adapter 协议
## 相关工作
Claw-SWE-Bench 在三个方面区别于之前的 SWE-bench 衍生工作:
- HAL倡导整体 accuracy-cost-latency 评测,但仅发布一个 harness
- SWE-Bench Pro统一 scaffolding 但用于比较模型而非 harness
- SWE-Effi注意到 scaffold-model 纠缠但未作为受控测量
## 参考
- [[claw-swe-bench|Claw-SWE-Bench 论文]]
- [[patch-based-evaluation|Patch-Based Evaluation]]
- [[cost-aware-benchmarking|代价感知基准评测]]