Files
myWiki/raw/papers/zheng-claw-swe-bench-2026.md

42 lines
2.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
source_url: https://arxiv.org/abs/2606.12344v1
ingested: 2026-06-15
arxiv_id: 2606.12344v1
sha256: TBD
---
# Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks
**Authors:** Mengyu Zheng, Kai Han, Boxun Li, Haiyang Xu, Yuchuan Tian, Wei He, Hang Zhou, Jianyuan Guo, Hailin Hu, Lin Ma, Chao Xu, Guohao Dai, Lixue Xia, Yunchao Wei, Yunhe Wang, Yu Wang
**Affiliations:** TokenRhythm Technologies, Infinigence AI, City University of Hong Kong, SEE Fund, Peking University, Shanghai Jiaotong University, Beijing Jiaotong University, Tsinghua University
**arXiv:** 2606.12344v1 | **Date:** 2026-06-10 | **Categories:** cs.LG, cs.CL
**Resources:** https://github.com/opensquilla/claw-swe-bench | https://huggingface.co/datasets/TokenRhythm/Claw-SWE-Bench
## Abstract
General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring. We introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol that makes heterogeneous agent harnesses, or claws, comparable under fair settings including a fixed prompt, runtime budget, workspace contract, patch extraction procedure, and evaluator.
The full benchmark contains 350 GitHub issue-resolution instances across 8 languages and 43 repositories, drawn from SWE-bench-Multilingual and SWE-bench-Verified-Mini after future-commit cleanup. We also release Claw-SWE-Bench Lite for faster validation, which is an 80-instance subset selected by a cost-aware, rank-aware procedure over 17 calibration columns.
Key findings:
- OpenClaw with minimal direct-diff adapter: 19.1% Pass@1
- OpenClaw with full adapter: 73.4% Pass@1 (same GLM 5.1 backbone)
- Model choice changes Pass@1 by 29.4 pp; harness choice by 27.4 pp
- Systems with similar accuracy can differ substantially in total API cost
- Claw-SWE-Bench treats harness and cost accounting as first-class evaluation axes
## Key Concepts
- Agent harness (claw) as controlled experimental variable
- Adapter protocol: lifecycle methods (create_agent, send_task, backup_session, delete_agent, get_docker_args)
- Full adapter vs bare adapter design
- Cost-aware benchmarking: Pass@1 + total API cost + wall-clock duration + cache hit rate
- Pareto frontier of accuracy vs cost
- Claw-SWE-Bench Lite: 80-instance cost-aware rank-aware subset
- Future-commit cleanup for fair evaluation
- Patch-based evaluation contract (git diff from /testbed)
- Harness × model interaction effects