20260617:目前有914 页

2026-06-17 15:02:40 +08:00
parent e96b955fda
commit 91fac5b6fc
423 changed files with 20687 additions and 34 deletions
--- a/concepts/pareto-frontier-evaluation.md
+++ b/concepts/pareto-frontier-evaluation.md
@@ -0,0 +1,33 @@
+---
+title: "Pareto 前沿评测 (Pareto Frontier Evaluation)"
+created: 2026-06-15
+updated: 2026-06-15
+type: concept
+tags: [benchmark, evaluation, cost]
+sources: [raw/papers/zheng-claw-swe-bench-2026.md]
+---
+
+# Pareto 前沿评测 (Pareto Frontier Evaluation)
+
+## 定义
+
+Pareto 前沿评测将 agent 系统置于 **准确率-代价** 二维平面上，通过 Pareto 前沿（连接所有非支配操作点的曲线）识别在多目标下最优的系统组合。一个操作点被支配，当存在另一个操作点同时在准确率上不差且代价上更低。
+
+## Claw-SWE-Bench 中的应用
+
+在五 claw × 双模型的 350-instance 扫掠中，每个点代表一个 claw-model 组合在完整评估上的 Pass@1 与总 API 代价。Pareto 前沿揭示了：
+
+- **准确率和代价不共线**——更高准确率不一定意味着更高代价
+- OpenClaw × DeepSeek-V4 Flash（70.3%, $8.2）和 OpenClaw × GLM 5.1（73.4%, $277）都在前沿附近但位置差异巨大
+- 仅报告 Resolved Rate 会掩盖代价信息，可能误导 Leaderboard 解读
+
+## Pareto 前沿 vs 单一指标
+
+单一指标（如仅 Pass@1）的问题：
+- 无法区分 "高准确率高代价" 和 "略低准确率极低代价" 的系统
+- 对资源受限的研究者（小团队、学术组）不友好
+- 无法回答 "这个准确率提升值多少钱？"
+
+## 参考
+- [[claw-swe-bench|Claw-SWE-Bench 论文]]
+- [[cost-aware-benchmarking|代价感知基准评测]]