20260617:目前有914 页

2026-06-17 15:02:40 +08:00
parent e96b955fda
commit 91fac5b6fc
423 changed files with 20687 additions and 34 deletions
--- a/concepts/critpt.md
+++ b/concepts/critpt.md
@@ -0,0 +1,51 @@
+---
+title: "CritPt (Critical Point Benchmark)"
+created: 2026-06-14
+updated: 2026-06-14
+type: concept
+tags: [benchmark, physics, reasoning, frontier]
+sources: [raw/papers/procedural-skills-to-strategy-genes-2026.md]
+---
+
+# CritPt (Critical Point Benchmark)
+
+前沿物理学研究基准，由 Zhu et al. (2025) 构建，用于探测 AI 推理的"临界点"。Wang et al. (2026) 将其用作基因进化系统的外部验证基准。
+
+## 在 Skills to Strategy Genes 中的作用
+
+CritPt 被用作 [[evolution-probe|进化探针]] 的外部验证：gene-evolved 系统是否能在**不同于训练分布的挑战性基准**上展现出实质性改进？
+
+## 关键结果
+
+两个基于 Gene 的进化系统在两个版本的 Evolver 上运行：
+
+| 系统 | 基础模型 | 进化后 | 提升 |
+|------|---------|--------|------|
+| Evolver 2026-02-16 | Gemini 3 Pro Preview | 18.57% (from 9.1%) | +9.47pp |
+| Evolver 2026-03-26 | Gemini 3.1 Pro Preview | 27.14% (from 17.7%) | +9.44pp |
+
+### 上下文对比
+
+- GPT-5.4 Pro (xhigh): 30.0%
+- GPT-5.4 (xhigh): 27.14%
+- Evolver 2026-03-26: 27.14%
+- Gemini 3Deep Think: 25.7%
+- Claude Opus 4.6 (max): 12.6%
+- Claude Sonnet 4.5 (max): 3.1%
+
+基因进化的系统达到了与 GPT-5.4 (xhigh) 并列的水平，显著超越基础模型。
+
+## 进化轨迹
+
+### 版本 A (2026-02-16): 记忆根植的进化
+主要增益来自将先前失败、执行轨迹和修正经验巩固为可复用控制单元。代表性基因：`gene_gep_repair_from_errors`（触发于 error/exception/failed 信号的结构化修复循环）。
+
+### 版本 B (2026-03-26): 探索增强的进化
+210 个 gene slot，36 个唯一 gene ID。以 arXiv 衍生基因 (148 次选择) 为主。最具代表性的是 `gene_topic_hamiltonian_inverse_design`（25 次选择）：将紧凑的任务导向解决过程保留为可复用资产。
+
+## 参考
+
+- [[procedural-skills-to-strategy-genes|Skills to Strategy Genes]] — 使用 CritPt 验证进化
+- [[evolution-probe|进化探针]] — CritPt 实验所属探针
+- [[strategy-gene|策略基因]] — 进化单元
+- [CritPt 原论文](https://arxiv.org/abs/2509.26574)