42 lines
2.6 KiB
Markdown
42 lines
2.6 KiB
Markdown
---
|
||
source_url: https://arxiv.org/abs/2606.12344v1
|
||
ingested: 2026-06-15
|
||
arxiv_id: 2606.12344v1
|
||
sha256: TBD
|
||
---
|
||
|
||
# Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks
|
||
|
||
**Authors:** Mengyu Zheng, Kai Han, Boxun Li, Haiyang Xu, Yuchuan Tian, Wei He, Hang Zhou, Jianyuan Guo, Hailin Hu, Lin Ma, Chao Xu, Guohao Dai, Lixue Xia, Yunchao Wei, Yunhe Wang, Yu Wang
|
||
|
||
**Affiliations:** TokenRhythm Technologies, Infinigence AI, City University of Hong Kong, SEE Fund, Peking University, Shanghai Jiaotong University, Beijing Jiaotong University, Tsinghua University
|
||
|
||
**arXiv:** 2606.12344v1 | **Date:** 2026-06-10 | **Categories:** cs.LG, cs.CL
|
||
|
||
**Resources:** https://github.com/opensquilla/claw-swe-bench | https://huggingface.co/datasets/TokenRhythm/Claw-SWE-Bench
|
||
|
||
## Abstract
|
||
|
||
General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring. We introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol that makes heterogeneous agent harnesses, or claws, comparable under fair settings including a fixed prompt, runtime budget, workspace contract, patch extraction procedure, and evaluator.
|
||
|
||
The full benchmark contains 350 GitHub issue-resolution instances across 8 languages and 43 repositories, drawn from SWE-bench-Multilingual and SWE-bench-Verified-Mini after future-commit cleanup. We also release Claw-SWE-Bench Lite for faster validation, which is an 80-instance subset selected by a cost-aware, rank-aware procedure over 17 calibration columns.
|
||
|
||
Key findings:
|
||
- OpenClaw with minimal direct-diff adapter: 19.1% Pass@1
|
||
- OpenClaw with full adapter: 73.4% Pass@1 (same GLM 5.1 backbone)
|
||
- Model choice changes Pass@1 by 29.4 pp; harness choice by 27.4 pp
|
||
- Systems with similar accuracy can differ substantially in total API cost
|
||
- Claw-SWE-Bench treats harness and cost accounting as first-class evaluation axes
|
||
|
||
## Key Concepts
|
||
|
||
- Agent harness (claw) as controlled experimental variable
|
||
- Adapter protocol: lifecycle methods (create_agent, send_task, backup_session, delete_agent, get_docker_args)
|
||
- Full adapter vs bare adapter design
|
||
- Cost-aware benchmarking: Pass@1 + total API cost + wall-clock duration + cache hit rate
|
||
- Pareto frontier of accuracy vs cost
|
||
- Claw-SWE-Bench Lite: 80-instance cost-aware rank-aware subset
|
||
- Future-commit cleanup for fair evaluation
|
||
- Patch-based evaluation contract (git diff from /testbed)
|
||
- Harness × model interaction effects
|