20260601

2026-06-01 10:46:01 +08:00
parent 2faf4bb002
commit e96b955fda
221 changed files with 10219 additions and 332 deletions
--- a/raw/articles/chensizhou-mini-agent-harness-2026.md
+++ b/raw/articles/chensizhou-mini-agent-harness-2026.md
@@ -0,0 +1,95 @@
+---
+title: "从零搭建 Mini Agent Harness"
+author: "陈思州 (Datawhale)"
+source: "微信公众号 Datawhale干货"
+date: "2026-05"
+url: "https://mp.weixin.qq.com/s/yVFQej3dFk9KHv6J2u6Lew"
+type: "article"
+---
+
+# 从零搭建 Mini Agent Harness
+
+**作者**：陈思州，Datawhale 成员  
+**来源**：Datawhale干货（微信公众号）
+
+## 全文
+
+前面讲 Agent 评测时，我提到：评测 Agent 不能只看最终答案，还要看它用了什么工具、拿到了什么结果、有没有按任务要求完成。那这些东西要怎么稳定记录下来？这就需要一个 harness。
+
+现在有一个观点是 **Agent = model + harness**；
+
+我会把 harness 理解成：把 Agentic model 放进一个可运行、可记录、可评分的小环境里。它不一定一开始就很复杂，只要能把任务、工具、执行过程和评分结果串起来，就已经很有价值。
+
+这篇按 4 个问题梳理：
+1. 一个最 mini 的 harness 解决什么问题？
+2. 它最少需要哪些模块？
+3. 一个 eval case 可以怎么写？
+4. 公开资料里有哪些参考？
+
+### 一个最 mini 的 harness 解决什么问题
+
+如果只是手动测试 Agent，很容易只看到最后回答。比如用户问"请判断这个项目是否支持插件系统"，Agent 回答"当前 README 没有插件系统相关说明，不能确认支持"。
+
+这句话看起来合理，但我们还需要知道：它有没有真的读取 README？有没有读错文件？有没有调用无关工具？有没有把工具结果里没有的信息写进答案？
+
+mini harness 要解决的就是这个问题。
+
+它把任务放进一个固定环境里，让 Agent 使用指定工具完成任务，同时记录执行过程，最后用评分器判断结果。
+
+这样我们看到的就不只是一句回答，而是一条完整记录：任务是什么，环境里有什么，Agent 调用了什么工具，工具返回了什么，最后为什么被判成功或失败。
+
+### mini harness 最少需要哪些模块
+
+最小结构拆成 5 个模块：
+
+- **Task**（任务输入）：任务本身
+- **Environment**（可操作环境）：代码仓库、文件组等
+- **Tools**（工具接口）：read_file、list_files、run_tests 等
+- **Trace**（执行记录）：每步的工具调用、参数、返回
+- **Grader**（评分器）：规则或测试脚本判断结果
+
+### 一个 eval case 可以怎么写
+
+```json
+{
+  "id": "case_001",
+  "task": "判断项目是否支持插件系统",
+  "environment": {
+    "files": {
+      "README.md": "本项目支持本地启动、基础登录和配置管理。",
+      "config.md": "配置项包括 port、theme、log_level。"
+    }
+  },
+  "tools": ["list_files", "read_file"],
+  "grader": {
+    "must_read": ["README.md"],
+    "answer_should_include": "不能确认支持插件系统",
+    "answer_should_not_include": "支持插件系统"
+  }
+}
+```
+
+跑完后记录 trace：
+
+```json
+{
+  "case_id": "case_001",
+  "trace": [
+    {"tool": "list_files", "arguments": {"path": "."}, "result": ["README.md", "config.md"]},
+    {"tool": "read_file", "arguments": {"path": "README.md"}, "result": "本项目支持本地启动、基础登录和配置管理。"}
+  ],
+  "answer": "当前 README 没有插件系统相关说明，不能确认支持插件系统。",
+  "grade": {"success": true, "reason": "读取了 README，回答没有超出文件内容。"}
+}
+```
+
+### 公开资料参考
+
+- **Anthropic Agent Evals**：区分 eval harness 和 agent harness，强调评估的是模型+harness 的整体效果
+- **SWE-agent**：提出 Agent-Computer Interface（ACI），说明外部接口设计对 Agent 表现的影响
+- **Terminal-Bench**：任务结构包含 instruction、隔离环境、测试脚本
+- **SWE-bench**：典型评测流程——真实 issue → patch → 环境测试
+
+### 核心观点
+
+一个 mini Agent harness 不需要一开始做成完整平台。第一版只要能串起任务、环境、工具、执行记录和评分器，就已经能帮我们观察 Agent 到底哪里出问题。有了这套结构，我们就不只是"试一下 Agent 好不好用"，而是能分析问题出在任务理解、工具选择、参数填写、结果读取、步骤冗余，还是评分规则本身不清楚。
--- a/raw/articles/claw-eval-2026.md
+++ b/raw/articles/claw-eval-2026.md
@@ -0,0 +1,51 @@
+---
+source_url: https://mp.weixin.qq.com/s/4oY35c9SmweJ4Vi0KztVOA
+ingested: 2026-05-23
+sha256: unknown
+---
+
+# Claw-Eval：一个面向自主 Agent 的端到端评测框架
+
+来源：ModelScope 公众号
+
+## 引言
+
+随着大模型从"回答问题"走向"执行任务"，Agent 评测正在成为能力评估的关键方向。Claw-Eval 关注的不只是任务有没有完成，更关注任务是如何被完成的：过程是否可追溯，行为是否合规，异常发生后能否恢复。300 个人工验证任务，从完成度、安全性和鲁棒性三个维度评估 14 个前沿模型。
+
+## 开源地址
+
+- 数据集：https://modelscope.cn/datasets/claw-eval/Claw-Eval
+- 排行榜：https://claw-eval.github.io/#/
+- GitHub：https://github.com/claw-eval/claw-eval
+
+## 技术框架
+
+- 轻量运行层：透明、可审计、可复现的"最大公约数"运行基座
+- Setup → Execution → Judge 生命周期：完整记录模型行为、工具调用、服务端日志和环境快照
+- 真实任务：服务编排、多模态理解与生成、多轮专业对话
+
+## 任务设计
+
+300 个人工验证任务，覆盖 9 个细分类型，三大任务组：
+- **通用服务任务**：查询、日程安排、跨服务协作、数据检索、金融合规、运营流程
+- **多模态任务**：视频、文档、图像和代码生成视觉产物
+- **多轮专业对话任务**：咨询、分析和决策场景
+
+## 评分体系（三维护）
+
+- **Completion**：任务是否完成，结果是否符合要求
+- **Safety**：执行过程是否遵守约束，是否避免不该发生的行为
+- **Robustness**：面对接口失败、服务延迟、临时错误时，是否能够恢复并继续执行
+
+同时报告 Pass@3（三次中至少成功一次，接近能力上限）和 Pass^3（三次全部成功，接近可靠性下限）
+
+## 三个关键发现
+
+1. **只看对话轨迹不可靠**：LLM Judge 漏掉了 44% 安全违规和 13% 鲁棒性问题 — 需要服务端日志和环境快照
+2. **能力不等于稳定性**：错误注入后 Pass^3 最高下降 24 个百分点
+3. **Agent 能力是多维的**：没有一个模型在所有任务类型上全面领先；最高多模态 Pass^3 仅 25.7%
+
+## 额外发现
+
+- 问题质量（而非数量）解释 76% 的 Pass^3 表现差异
+- 好的 Agent 不只是会追问，更要知道当前最该问什么
--- a/raw/articles/distributed-agent-cache-sync-2026.md
+++ b/raw/articles/distributed-agent-cache-sync-2026.md
@@ -0,0 +1,27 @@
+---
+title: "分布式Agent缓存同步"
+created: 2026-05-29
+type: article-raw
+source: "微信公众号"
+url: "https://mp.weixin.qq.com/s/MUWV7eug14bktUMlqsxfQw"
+tags: ["distributed-systems", "prompt-caching", "quant-trading", "agent", "redis", "rdma"]
+---
+
+# 分布式Agent缓存同步
+
+**来源**: 微信公众号
+**URL**: https://mp.weixin.qq.com/s/MUWV7eug14bktUMlqsxfQw
+**收录时间**: 2026-05-29
+
+## 概述
+
+本文是 LLM + 量化交易系列文章中关于分布式环境下 Prompt Caching 同步的深度工程实践章节。以高频量化系统为场景，系统性地阐述了如何将单机 Prompt Caching 机制升级为跨物理节点的分布式缓存同步体系。
+
+## 核心内容
+
+1. **分布式架构中的缓存多机异构冲突**: 跨机冷启动代价（150k Token 重传 + 秒级重算）、跨模型服务商的缓存割裂
+2. **基于 Redis 骨干网的分布式 Token 状态路由**: 全局上下文哈希树（SHA-256 四层复合键）、Cache_Routing_Table 物理实现
+3. **跨机主动预热与流水线预加载**: 交易临界点预测触发、Shadow Calling 三步法（前缀拓扑合成→异步影子调用→状态置标）
+4. **数据一致性治理**: 乐观锁与上下文版本号机制、交易生命周期驱动的 TTL 淘汰策略
+5. **C++ IPC 与分布式网络的无缝桥梁**: RDMA 旁路网络句柄分发架构
+6. **混沌工程**: 缓存雪崩降级熔断、Context Pruning、可观测性控制台
--- a/raw/articles/lyu-model-harness-evolution-2026.md
+++ b/raw/articles/lyu-model-harness-evolution-2026.md
@@ -0,0 +1,33 @@
+---
+title: "Model与Harness的关系演进：从AutoHarness到Heuristic Learning"
+created: 2026-05-29
+type: article-raw
+source: "微信公众号"
+author: "吕明"
+url: "https://mp.weixin.qq.com/s/PglkqhlSoI7LEOb3AOHl8g"
+tags: ["model", "harness", "agent", "genai", "heuristic-learning", "autoharness"]
+---
+
+# Model与Harness的关系演进
+
+**作者**: 吕明
+**来源**: 微信公众号
+**URL**: https://mp.weixin.qq.com/s/PglkqhlSoI7LEOb3AOHl8g
+**收录时间**: 2026-05-29
+
+## 概述
+
+本文是吕明关于 Model 与 Harness 关系演进的深度思考笔记。以 Google DeepMind 的 AutoHarness 论文和 OpenAI 翁家翌的 Heuristic Learning 文章为切入点，探讨：
+
+1. GenAI 与前几次 AI 浪潮的三个本质差异：生成式（Generative）、通用性（General）、统一性（Unification）
+2. Model 与 Harness 之间"策略算法"与"工程约束"的模糊边界及其演进
+3. AutoHarness 三种 Harness 模式的深度解读（Action Filter → Action Verifier → Policy）
+4. Heuristic Learning 作为替代梯度下降的新学习范式
+5. 编译型 AI 范式与 Harness Engineering 作为独立工程实践领域
+6. 引述 Demis Hassabis 近期访谈观点：Agent 才刚开始，缺连续学习
+
+## 关键引用
+
+- "也许世界的本质即是由泛化策略+抽象约束的组合控制和运转的"
+- "性能提升不只能一味的依赖于模型参数规模，也应更多关注 Agent Architecture 的 Harness 层"
+- "某种形式的经验或知识不仅可以被'训练'到参数里，还可以被更优雅的'编程'为可维护、可进化的软件系统"
--- a/raw/articles/lyu-skillopt-deep-dive-2026.md
+++ b/raw/articles/lyu-skillopt-deep-dive-2026.md
@@ -0,0 +1,28 @@
+---
+title: "SkillOpt深度解读：文本空间优化与自进化Agent的工程化Continued Evolve"
+created: 2026-05-29
+type: article-raw
+source: "微信公众号"
+author: "吕明"
+url: "https://mp.weixin.qq.com/s/s__fdyXQG932SavQeeugcw"
+tags: ["skillopt", "text-space-optimization", "self-evolution", "harness", "model-harness"]
+---
+
+# SkillOpt深度解读
+
+**作者**: 吕明
+**来源**: 微信公众号
+**URL**: https://mp.weixin.qq.com/s/s__fdyXQG932SavQeeugcw
+**收录时间**: 2026-05-29
+
+## 概述
+
+本文是吕明对微软 SkillOpt 论文的深度哲学解读（约1.2万字），以"当Skill文件拥有了自己的反向传播"为引子，系统剖析了文本空间优化与参数空间梯度下降的深层分野，并勾勒出自进化Agent的工程化蓝图。
+
+## 核心内容
+
+1. **表层同构与深层分野**: 连续梯度下降（局部一阶、解析链式法则、向量空间度量）vs 离散文本优化（全局因果推理、经验性验证、无天然度量）
+2. **哲学隐喻**: 英国经验主义（参数被动被 Loss 塑形）vs 大陆理性主义（Optimizer 主动理性演绎）
+3. **三层解耦设计**: 冻结 Agent + 独立 Optimizer + 受控接受/拒绝
+4. **全栈蓝图**: Skill Registry → Validation Suite → Evolution Scheduler → Cross-Model Translator → Human-in-the-Loop
+5. **"受控的自主性"**: 人类设定目标（验证集）和边界（编辑约束），Agent 在框架内自主寻优
--- a/raw/articles/tps-time-series-augmentation-survey-2026.md
+++ b/raw/articles/tps-time-series-augmentation-survey-2026.md
@@ -0,0 +1,64 @@
+---
+title: "时序预测增强方法综述：从频域到 TPS"
+author: "Sai Nitesh Palamakula (译：于腾凯)"
+source: "DeepHub IMBA / 数据派THU (微信公众号)"
+date: "2026-05"
+url: "https://mp.weixin.qq.com/s/hPvx3OflUva1olME9F8FoA"
+type: "article"
+---
+
+# 时序预测增强方法综述：从频域到 Temporal Patch Shuffle
+
+**来源**：DeepHub IMBA / 数据派THU
+
+## 核心问题
+
+时间序列预测的增强与分类增强有本质区别——预测目标是连续信号，而非离散标签。
+经典分类增强（jittering、scaling、warping）会破坏 look-back 窗口与预测 horizon 之间的连续性，
+导致 input-target 不一致。
+
+**核心原则**：增强必须作用于拼接后的完整序列 s = x ∥ y，再切分回输入和目标，以确保数据-标签一致性。
+
+## 方法分类体系
+
+### 基于频率
+- **RobustTAD**：DFT → 幅度/相位扰动 → IDFT
+- **FreqMask**：FFT → 二值 mask 清零选定频率 → IFFT
+- **FreqMix**：FFT → 两序列频谱混合 → IFFT
+- **WaveMask**：DWT 分解 → 各层选择性 mask 小波系数 → 逆 DWT
+- **WaveMix**：DWT 分解 → 两序列小波系数交叉混合 → 逆 DWT
+- **Dominant Shuffle**：FFT → 选 top-k 主导频率 shuffle → IFFT
+
+### 基于分解
+- **STAug**：EMD → IMF → mixup 式重组（内存开销大，大数据集受限）
+
+### 其他
+- **wDBA**：DTW 对齐下的时序平均
+- **MBB**：STL 分解 + 残差 bootstrap
+- **Upsample**：线性插值拉伸局部片段
+
+### 基于 Patch
+- **TPS (Temporal Patch Shuffle)**：重叠 patch → variance 评分 → 选择性 shuffle → 重叠区域平均重建
+
+## TPS 核心流程
+
+1. **拼接**：x ∥ y → s（强制数据-标签一致性）
+2. **Temporal Patching**：patch 长度 p、stride s，提取重叠 patch
+3. **Variance 评分**：跨通道计算每个 patch 的 variance
+4. **选择性 Shuffle**：低 variance 的 α 比例 patch 被随机置换
+5. **重建**：重叠区域取平均，平滑 shuffle 引入的不连续性
+6. **拆分**：s̃ → x̃, ỹ
+
+## 消融实验关键发现
+
+1. **数据-标签一致性**：决定性因素，单一消融中性能下降最大
+2. **重叠 patch**：换成非重叠→明显退化，重叠是保留局部时间结构的闸门
+3. **Variance 排序**：适度红利，α=1.0 时失去意义
+4. **时域优于频域**：FFT 变换后的 patch 操作会退化
+5. **Shuffle 比例**：0.7-1.0 最优
+
+## 实验结果
+
+- **长期预测**：9 个数据集、5 个骨干（TSMixer、DLinear、PatchTST、TiDE、LightTS），TPS 全部最佳
+- **短期交通预测**：4 个 PeMS 数据集（PatchTST），MSE 提升 2.34%-7.14%
+- **分类扩展**：UCR + UEA 基准，准确率分别提升 0.50% 和 1.10%
--- a/raw/articles/ultradata-l3-open-source-2026.md
+++ b/raw/articles/ultradata-l3-open-source-2026.md
@@ -0,0 +1,28 @@
+---
+title: "UltraData：面壁智能L3数据集开源与L0-L4数据分级治理体系"
+created: 2026-05-29
+type: article-raw
+source: "微信公众号 (Datawhale)"
+author: "面壁智能团队"
+url: "https://mp.weixin.qq.com/s/5jV2jYuXJloKX5IWCzrSpw"
+tags: ["data-governance", "pretraining", "synthetic-data", "sft", "open-source", "minicpm"]
+---
+
+# UltraData：面壁智能L3数据集开源与L0-L4数据分级治理体系
+
+**作者**: 面壁智能团队
+**来源**: Datawhale (微信公众号)
+**URL**: https://mp.weixin.qq.com/s/5jV2jYuXJloKX5IWCzrSpw
+**收录时间**: 2026-05-29
+
+## 概述
+
+2026年5月，面壁智能联合清华大学、OpenBMB开源社区正式发布 UltraData 系列两大 L3 层级数据集：Ultra-FineWeb-L3 与 UltraData-SFT-2605。基于 L0-L4 数据分级治理体系构建，在 MiniCPM5-1B 训练中完成全链路验证。
+
+## 核心内容
+
+1. **L0-L4 分级治理**: 从原始网页(L0)到RAG编排数据(L4)的五级体系，按训练阶段匹配数据层级
+2. **Ultra-FineWeb-L3**: 全球最大中文预训练合成数据(600B Tokens)，将"可读文本"转化为"好学数据"
+3. **UltraData-SFT-2605**: 国内首次开源千万级SFT数据，含"深思考/非思考"全覆盖
+4. **MiniCPM5-1B**: 登顶Artificial Analysis排行榜(17.9分)，INT4仅0.5GB
+5. **全流程透明**: 公开Query筛选、Answer校验、评测去污等完整治理工具链
--- a/raw/papers/agarwal-bayesian-attention-geometry-2026.md
+++ b/raw/papers/agarwal-bayesian-attention-geometry-2026.md
@@ -0,0 +1,61 @@
+---
+title: "The Bayesian Geometry of Transformer Attention"
+authors: "Naman Agarwal, Siddhartha R. Dalal, Vishal Misra"
+arxiv: "2512.22471"
+venue: "arXiv (cs.LG)"
+date: "2026-05"
+type: "paper"
+series: "Bayesian Attention Trilogy, Paper I"
+---
+
+# The Bayesian Geometry of Transformer Attention
+
+**Paper I of the Bayesian Attention Trilogy**
+
+**Authors**: Naman Agarwal (Dream Sports → Google DeepMind), Siddhartha R. Dalal (Columbia), Vishal Misra (Columbia)
+
+## TL;DR
+
+Small transformers achieve exact Bayesian posteriors (10⁻³–10⁻⁴ bit accuracy) in **Bayesian wind tunnels** — controlled environments where the true posterior is known in closed form and memorization is provably impossible. MLPs fail by orders of magnitude.
+
+## Core Framework: Bayesian Wind Tunnels
+
+Controlled prediction tasks where:
+1. Analytic posterior is known exactly at each step
+2. Hypothesis space is too large for memorization
+3. In-context prediction requires genuine probabilistic inference
+
+Converts "does it do Bayes?" into a quantitative test: **does the model's predictive entropy match the analytic posterior entropy?**
+
+## Three Inference Primitives
+
+| Primitive | Definition | Required for |
+|-----------|-----------|-------------|
+| Belief Accumulation | Integrating evidence into running posterior | Bijection learning, HMM |
+| Belief Transport | Propagating beliefs through stochastic dynamics | HMM filtering |
+| Random-Access Binding | Retrieving by content, not position | Associative recall |
+
+## Architectural Realizability
+
+| Architecture | Accumulation | Transport | Binding | Status |
+|-------------|:---:|:---:|:---:|--------|
+| Transformer | ✅ | ✅ | ✅ | Full primitive completeness |
+| Mamba (SSM) | ✅ | ✅ | ❌ | SOTA on HMM filtering; fails binding |
+| LSTM | ✅ | ❌ | ❌ | Only static sufficient statistics |
+| MLP | ❌ | ❌ | ❌ | Fails uniformly |
+
+## Key Geometric Findings
+
+- **Orthogonal key bases** in attention heads
+- **Low-dimensional value manifold** parameterized by posterior entropy
+- Mamba's final layer organizes into **5 clusters** — one per HMM hidden state (corner geometry of belief simplex)
+
+## Structural Theorem
+
+> The dominance of transformers in reasoning tasks arises not from scale alone, but from **primitive completeness**: they are the minimal architecture realizing the full set of inference primitives.
+
+## Trilogy Context
+
+- **Paper I** (this): Existence + internal geometry of exact Bayesian inference in transformers
+- **Paper II**: Bayesian geometry arises generically from gradient dynamics under cross-entropy
+- **Paper III**: How primitives compose in partially observed settings (closer to natural language)
--- a/raw/papers/agent-harness-engineering-survey-2026.md
+++ b/raw/papers/agent-harness-engineering-survey-2026.md
@@ -0,0 +1,27 @@
+---
+source_url: user-upload
+ingested: 2026-05-23
+sha256: unknown
+---
+
+# Agent Harness Engineering: A Survey
+
+## Metadata
+- **Authors**: Junjie Li^1,6^*, Xi Xiao^6^*, Yunbei Zhang^5^*, Chen Liu^2^*, Lin Zhao^4, Xiaoying Liao^3, Yingrui Ji^6, Janet Wang^6, Jianyang Gu^7, Yingqiang Ge^9, Weijie Xu^9, Xi Fang^9, Xiang Xu^9, Tianchen Zhao^9, Youngeun Kim^9, Tianyang Wang^6, Jihun Hamm^5, Smita Krishnaswamy^2, Jun Huan^9, Chandan K Reddy^8,9
+- **Institutions**: 1 CMU, 2 Yale, 3 JHU, 4 NEU, 5 Tulane, 6 UAB, 7 OSU, 8 Virginia Tech, 9 Amazon
+- **Venue**: Under review at TMLR (Transactions on Machine Learning Research), 2026
+- **Project Page**: Awesome-Agent-Harness
+
+## Abstract
+
+The rapid deployment of large language model (LLM) agents in production has revealed a recurring pattern: task execution reliability depends less on the underlying model than on the infrastructure layer that wraps it — the **agent execution harness**. This survey provides a practice-grounded, systematic treatment of agent harness engineering, organized around three claims:
+
+1. **Binding-Constraint Thesis**: The agent harness is an independent system layer whose engineering quality drives a large share of real-world reliability
+2. **ETCLOVG Taxonomy**: A seven-layer taxonomy (Execution environment, Tool interface, Context management, Lifecycle/Orchestration, Observability, Verification, Governance) 
+3. **Ecosystem Mapping**: 170+ open-source projects mapped onto this taxonomy
+
+## Key Contributions
+
+- Three-phase engineering evolution: Prompt → Context → Harness Engineering
+- Cross-layer synthesis: Cost-Quality-Speed Trilemma, Capability-Control Tradeoff, Harness Coupling Problem
+- Open-problem agenda spanning harden/scale execution, maintain reliable state, diagnose from traces, standardize handoffs, and adaptive simplification
--- a/raw/papers/agent-harness-engineering-survey-2026.pdf
+++ b/raw/papers/agent-harness-engineering-survey-2026.pdf
--- a/raw/papers/gram-generative-recursive-reasoning-2026.md
+++ b/raw/papers/gram-generative-recursive-reasoning-2026.md
@@ -0,0 +1,23 @@
+---
+source_url: https://arxiv.org/abs/2605.19376
+ingested: 2026-05-23
+sha256: unknown
+---
+
+# Generative Recursive Reasoning
+
+- **Authors**: Junyeob Baek^1*, Mingyu Jo^1*, Minsu Kim^1,2, Mengye Ren^3, Yoshua Bengio^2,4, Sungjin Ahn^1,3†
+- **Institutions**: 1 KAIST, 2 Mila – Québec AI Institute, 3 New York University, 4 Université de Montréal
+- **arXiv**: 2605.19376 (v2, 2026-05-19)
+- **Category**: cs.AI
+- **Project Page**: https://ahn-ml.github.io/gram-website
+
+## Abstract
+
+How should future neural reasoning systems implement extended computation? Recursive Reasoning Models (RRMs) offer a promising alternative to autoregressive sequence extension by performing iterative latent-state refinement with shared transition functions. Yet existing RRMs are largely deterministic, following a single latent trajectory and converging to a single prediction. GRAM turns recursive latent reasoning into probabilistic multi-trajectory computation, treating reasoning as a stochastic latent trajectory that enables multiple hypotheses, alternative solution strategies, and inference-time scaling through both recursive depth and parallel trajectory sampling. This yields a latent-variable generative model supporting both conditional reasoning p_θ(y|x) and unconditional generation p_θ(x). Trained with amortized variational inference, GRAM improves over deterministic recurrent and recursive baselines on structured reasoning and multi-solution constraint satisfaction tasks.
+
+## Key Contributions
+
+1. Formulates recursive reasoning as a latent-variable generative process
+2. Introduces width-based inference-time scaling (depth + parallel trajectories)
+3. Empirical evidence on Sudoku-Extreme, ARC-AGI, N-Queens, Graph Coloring, binarized MNIST
--- a/raw/papers/hu-toolcua-2026.md
+++ b/raw/papers/hu-toolcua-2026.md
@@ -0,0 +1,44 @@
+---
+title: "ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents"
+created: 2026-05-12
+type: paper
+source: https://arxiv.org/abs/2605.12481
+code: https://github.com/X-PLUG/ToolCUA
+authors:
+  - Xuhao Hu (Fudan)
+  - Xi Zhang (Alibaba)
+  - Haiyang Xu (Alibaba)
+  - Kyle Qiao (Alibaba)
+  - Jingyi Yang (Fudan)
+  - Xuanjing Huang (Fudan)
+  - Jing Shao (Shanghai AI Lab)
+  - Ming Yan (Alibaba)
+  - Jieping Ye (Alibaba)
+venue: arXiv:2605.12481, 2026
+tags: [computer-use-agents, gui-tool-orchestration, reinforcement-learning, trajectory-optimization]
+---
+
+# ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
+
+**Authors**: Xuhao Hu, Xi Zhang, Haiyang Xu, Kyle Qiao, Jingyi Yang, Xuanjing Huang, Jing Shao, Ming Yan, Jieping Ye
+
+**Affiliations**: Tongyi Lab, Alibaba Group; Fudan University; Shanghai Artificial Intelligence Laboratory
+
+**arXiv**: 2605.12481 | **Date**: May 12, 2026 | **Code**: https://github.com/X-PLUG/ToolCUA
+
+## Abstract
+
+Computer Use Agents (CUAs) can act through both atomic GUI actions, such as click and type, and high-level tool calls, such as API-based file operations, but this hybrid action space often leaves them uncertain about when to continue with GUI actions or switch to tools, leading to suboptimal execution paths. This difficulty stems from the scarcity of high-quality interleaved GUI-Tool trajectories, the cost and brittleness of collecting real tool trajectories, and the lack of trajectory-level supervision for GUI-Tool path selection. In this paper, we propose ToolCUA, an end-to-end agent designed to learn optimal GUI-Tool path selection through a staged training paradigm. We first introduce an Interleaved GUI-Tool Trajectory Scaling Pipeline that repurposes abundant static GUI trajectories and synthesizes a grounded tool library, enabling diverse GUI-Tool trajectories without manual engineering or real tool-trajectory collection. We then perform Tool-Bootstrapped GUI RFT, combining warmup SFT with single-turn RL to improve decisions at critical GUI-Tool switching points. Finally, we optimize ToolCUA with Online Agentic RL in a high-fidelity GUI-Tool environment, guided by a Tool-Efficient Path Reward that encourages appropriate tool use and shorter execution paths. Experiments on OSWorld-MCP show that ToolCUA achieves 46.85% accuracy, a relative improvement of approximately 66% over the baseline, establishing a new state of the art among models of comparable scale.
+
+## Key Concepts
+
+- [[computer-use-agents|Computer Use Agents (CUAs)]]
+- [[gui-tool-hybrid-action-space|GUI-Tool Hybrid Action Space]]
+- [[optimal-gui-tool-path-selection]]
+- [[interleaved-gui-tool-trajectory-scaling]]
+- [[tool-bootstrapped-rft]]
+- [[tool-efficient-path-reward]]
+- [[osworld-mcp]]
+- [[next-state-grounding]]
+- [[grpo]]
+- [[agent-computer-interface]]
--- a/raw/papers/kore-knowledge-injection.md
+++ b/raw/papers/kore-knowledge-injection.md
@@ -0,0 +1,39 @@
+---
+title: "KORE: Enhancing Knowledge Injection for Large Multimodal Models via Knowledge-Oriented Controls"
+authors:
+  - Kailin Jiang
+  - Hongbo Jiang
+  - Ning Jiang
+  - Zhi Gao
+  - Jinhe Bi
+  - Yuchen Ren
+  - Bin Li
+  - Yuntao Du
+  - Lei Liu
+  - Qing Li
+date: 2026
+arxiv: "2510.19316"
+venue: "ICML 2026"
+domain: "Multimodal Learning, Knowledge Injection, Continual Learning"
+type: paper
+source: "https://arxiv.org/abs/2510.19316"
+---
+
+# KORE: Enhancing Knowledge Injection for Large Multimodal Models via Knowledge-Oriented Controls
+
+**Authors**: Kailin Jiang, Hongbo Jiang, Ning Jiang, Zhi Gao, Jinhe Bi, Yuchen Ren, Bin Li, Yuntao Du, Lei Liu, Qing Li
+
+**Venue**: ICML 2026
+
+**arXiv**: 2510.19316
+
+## Abstract
+
+KORE is a synergistic method centered around Knowledge-Oriented Controls for injecting new knowledge into LMMs while preserving old knowledge. It implements a two-stage optimization: (1) KORE-AUGMENTATION converts individual knowledge items into structured multi-round dialogues and instruction tasks, building a "knowledge tree" that enables internalization; (2) KORE-CONSTRAINT stores previous knowledge in the covariance matrix of linear layer activations and initializes a LoRA adapter by projecting original weights into the matrix's null space, defining a fine-tuning direction that minimally interferes with previous knowledge.
+
+## Key Contributions
+
+1. **KORE-AUGMENTATION**: Structured knowledge augmentation pipeline — multi-round dialogues (trunk) + instruction tasks (branches) = knowledge tree
+2. **KORE-CONSTRAINT**: Null space projection via covariance matrix SVD — freezes adapter A in null space, fine-tunes only B
+3. **HARS metric**: Harmonized Adaptation-Retention Score for unified evaluation
+4. **State-of-the-art**: Outperforms 9 baselines on EVOKE benchmark across LLaVA-v1.5 (7B/13B) and Qwen2.5-VL (7B)
--- a/raw/papers/lou-autoharness-2026.md
+++ b/raw/papers/lou-autoharness-2026.md
@@ -0,0 +1,28 @@
+---
+title: "AutoHarness: improving LLM agents by automatically synthesizing a code harness"
+created: 2026-05-29
+type: paper-raw
+arxiv: "2603.03329"
+authors: ["Xinghua Lou", "Miguel Lázaro-Gredilla", "Antoine Dedieu", "Carter Wendelken", "Wolfgang Lehrach", "Kevin P. Murphy"]
+venue: "arXiv preprint (cs.CL), February 2026"
+affiliation: "Google DeepMind"
+tags: ["agent", "code-synthesis", "game-playing", "harness", "LLM"]
+---
+
+# AutoHarness: improving LLM agents by automatically synthesizing a code harness
+
+**Authors:** Xinghua Lou, Miguel Lázaro-Gredilla, Antoine Dedieu, Carter Wendelken, Wolfgang Lehrach, Kevin P. Murphy
+**Affiliation:** Google DeepMind
+**arXiv:** [2603.03329](https://arxiv.org/abs/2603.03329) (v1, 10 February 2026)
+**Category:** cs.CL (Computation and Language)
+
+## Abstract
+
+Despite significant strides in language models in the last few years, when used as agents, such models often try to perform actions that are not just suboptimal for a given state, but are strictly prohibited by the external environment. For example, in the recent Kaggle GameArena chess competition, 78% of Gemini-2.5-Flash losses were attributed to illegal moves. Often people manually write "harnesses" around LLMs to prevent such failures. In this paper, we demonstrate that Gemini-2.5-Flash can automatically synthesize such a code harness, using a small number of rounds of iterative code refinement given feedback from the (game) environment. The resulting harness prevents all illegal moves in 145 different TextArena games (both 1-player and 2-player), enabling the smaller Gemini-2.5-Flash model to outperform larger models, such as Gemini-2.5-Pro. Pushing our technique to the limit, we can get Gemini-2.5-Flash to generate the entire policy in code, thus eliminating the need to use the LLM at decision making time. The resulting code-policy receives a higher average reward than Gemini-2.5-Pro and GPT-5.2-High on 16 TextArena 1-player games.
+
+## Key Contributions
+
+1. **Code-as-Harness framework**: LLM synthesizes its own harness — transforms agent from LLM+hand-coded-plumbing to LLM+auto-generated-code
+2. **Thompson Sampling tree search**: structured exploration of code harness space
+3. **Three harness modes**: action-filter, action-verifier, and code-as-policy (zero LLM at inference)
+4. **100% legal moves** across 145 TextArena games; Flash+Harness outperforms Pro
--- a/raw/papers/peng-tst-2026.md
+++ b/raw/papers/peng-tst-2026.md
@@ -0,0 +1,28 @@
+---
+title: "Efficient Pre-Training with Token Superposition"
+created: 2026-05-29
+type: paper-raw
+arxiv: "2605.06546"
+authors: ["Bowen Peng", "Théo Gigant", "Jeffrey Quesnelle"]
+venue: "arXiv preprint (cs.CL), v2, May 2026"
+affiliation: "Nous Research"
+tags: ["pre-training", "efficiency", "token-superposition", "LLM"]
+---
+
+# Efficient Pre-Training with Token Superposition
+
+**Authors:** Bowen Peng*, Théo Gigant*, Jeffrey Quesnelle (* equal contribution)
+**Affiliation:** Nous Research
+**arXiv:** [2605.06546](https://arxiv.org/abs/2605.06546) (v2, 19 May 2026)
+**Category:** cs.CL (Computation and Language)
+
+## Abstract
+
+Pre-training of Large Language Models is often prohibitively expensive and inefficient at scale, requiring complex and invasive modifications in order to achieve high data throughput. In this work, we present Token-Superposition Training (TST), a simple drop-in method that significantly improves the data throughput per FLOPs during pre-training without modifying the parallelism, optimizer, tokenizer, data, or model architecture. TST is done in two phases: (i) A highly efficient superposition phase where we combine many contiguous tokens into one bag and train using a multi-hot cross-entropy (MCE) objective, and (ii) a recovery phase where we revert back to standard training. We extensively evaluate TST on the scale of 270M and 600M parameters and validate on 3B and a 10B A1B mixture of experts model, demonstrating that it is highly robust in different settings. Ultimately, TST consistently outperforms baseline loss and downstream evaluations, and under equal-loss settings, TST yields up to a 2.5x reduction in total pre-training time at the 10B A1B scale.
+
+## Key Contributions
+
+1. **Token-Superposition Training (TST)**: A two-phase drop-in method that increases token throughput s× per FLOP without modifying model architecture
+2. **Multi-hot Cross-Entropy (MCE)**: Novel loss function for predicting bags of tokens simultaneously
+3. **Representation Alignment Hypothesis**: Shared embeddings across phases are critical — re-initialization destroys gains
+4. **Extensive scaling validation**: 270M→600M→3B→10B A1B MoE, with 2.5× speedup at largest scale
--- a/raw/papers/pre-train-space-reinforcement-learning-2026.md
+++ b/raw/papers/pre-train-space-reinforcement-learning-2026.md
@@ -0,0 +1,27 @@
+---
+title: "Pre-train Space Reinforcement Learning: From P(y|x) to P(y)"
+arxiv: "2604.14142"
+authors: ["Yuqiao Tan", "Minzheng Wang", "Bo Liu", "Zichen Liu", "Tian Liang", "Shizhu He", "Jun Zhao", "Kang Liu"]
+venue: "arXiv preprint"
+date: "2026-04-15"
+type: paper
+tags: ["reinforcement-learning", "pre-training", "LLM", "reasoning", "GRPO"]
+---
+
+# Pre-train Space Reinforcement Learning
+
+> **arXiv**: [2604.14142](https://arxiv.org/abs/2604.14142)
+> **Authors**: Yuqiao Tan¹²*, Minzheng Wang¹²*, Bo Liu³, Zichen Liu³, Tian Liang⁴, Shizhu He¹²†, Jun Zhao¹², Kang Liu¹²
+> **Affiliations**: ¹ CASIA, ² UCAS, ³ NUS, ⁴ Tencent AI Lab
+> * Equal contribution, † Corresponding author
+
+## Abstract
+
+While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its potential is fundamentally bounded by the base model's existing output distribution. Optimizing the marginal distribution P(y) in the Pre-train Space addresses this bottleneck by encoding reasoning ability and preserving broad exploration capacity. Yet, conventional pre-training relies on static corpora for passive learning, leading to a distribution shift that hinders targeted reasoning enhancement. In this paper, we introduce PreRL (Pre-train Space RL), which applies reward-driven online updates directly to P(y). We theoretically and empirically validate the strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a viable surrogate for standard RL. Furthermore, we uncover a critical mechanism: Negative Sample Reinforcement (NSR) within PreRL serves as an exceptionally effective driver for reasoning. NSR-PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition and reflection thoughts by 14.89× and 6.54×, respectively. Leveraging these insights, we propose Dual Space RL (DSRL), a Policy Reincarnation strategy that initializes models with NSR-PreRL to expand the reasoning horizon before transitioning to standard RL for fine-grained optimization.
+
+## Key Claims
+
+1. **Gradient Alignment**: <∇log P(y), ∇log P(y|x)> ≥ 0 for all samples (empirically validated), confirming PreRL as a viable surrogate for standard RL
+2. **NSR > PSR in Pre-train Space**: Negative Sample Reinforcement (suppressing incorrect paths) is far more effective than Positive Sample Reinforcement in the pre-train space
+3. **DSRL outperforms GRPO**: Dual Space RL achieves +2-5 point improvement on benchmarks like AIME24/25, with 1.6×-2.5× sample efficiency
+4. **NSR-PreRL stimulates endogenous reasoning**: 14.89× more transition thoughts, 6.54× more reflection thoughts
--- a/raw/papers/when-large-multimodal-models-confront-evolving-knowledge.md
+++ b/raw/papers/when-large-multimodal-models-confront-evolving-knowledge.md
@@ -0,0 +1,40 @@
+---
+title: "When Large Multimodal Models Confront Evolving Knowledge: Challenges and Explorations"
+authors:
+  - Kailin Jiang
+  - Yuntao Du
+  - Yukai Ding
+  - Yuchen Ren
+  - Zhi Gao
+  - Zilong Zheng
+  - Ning Jiang
+  - Lei Liu
+  - Bin Li
+  - Qing Li
+date: 2026
+arxiv: "2505.24449"
+venue: "ICLR 2026"
+domain: "Multimodal Learning, Knowledge Injection, Continual Learning"
+type: paper
+source: "https://arxiv.org/abs/2505.24449"
+---
+
+# When Large Multimodal Models Confront Evolving Knowledge: Challenges and Explorations
+
+**Authors**: Kailin Jiang, Yuntao Du, Yukai Ding, Yuchen Ren, Zhi Gao, Zilong Zheng, Ning Jiang, Lei Liu, Bin Li, Qing Li
+
+**Venue**: ICLR 2026
+
+**arXiv**: 2505.24449
+
+## Abstract
+
+Large Multimodal Models (LMMs) store vast amounts of pretrained knowledge but struggle to remain aligned with real-world updates, making it difficult to avoid capability degradation when acquiring evolving knowledge. Furthermore, most current work focuses on exploring static textual knowledge injection, neglecting dynamic multimodal evolving knowledge injection. To address this, the authors propose MME VOKE, a benchmark for evaluating LMMs' ability in multimodal evolving knowledge injection, containing 9,422 samples spanning 159 subtypes. Through extensive experiments, they reveal challenges such as poor injection performance and capability degradation, and introduce knowledge augmentation and knowledge retention methods to address these challenges.
+
+## Key Contributions
+
+1. **MMEVOKE Benchmark**: First multimodal evolving knowledge injection benchmark with self-evolving data construction pipeline
+2. **Dual Challenge Identification**: Poor knowledge adaptation AND capability degradation after injection
+3. **Knowledge-Aware Augmentation**: Demonstrates semantic augmentation strengthens adaptation while surface-level augmentation is detrimental
+4. **Retention Methods**: Data Replay and MoELoRA effectively mitigate degradation; EWC/LwF fail
+5. **Sufficient Context Paradox**: Even with all necessary information, LMMs still produce incorrect answers
--- a/raw/papers/yang-skillopt-2026.md
+++ b/raw/papers/yang-skillopt-2026.md
@@ -0,0 +1,28 @@
+---
+title: "SkillOpt: Executive Strategy for Self-Evolving Agent Skills"
+created: 2026-05-29
+type: paper-raw
+arxiv: "2605.23904"
+authors: ["Yifan Yang", "Ziyang Gong", "Weiquan Huang", "Qihao Yang", "Ziwei Zhou", "Zisu Huang", "Yan Li", "Xuemei Gao", "Qi Dai", "Bei Liu", "Kai Qiu", "Yuqing Yang", "Dongdong Chen", "Xue Yang", "Chong Luo"]
+venue: "arXiv preprint (cs.AI), v2, May 2026"
+affiliation: "Microsoft, Shanghai Jiao Tong University, Tongji University, Fudan University"
+tags: ["agent", "skill", "optimization", "text-space", "self-evolving"]
+---
+
+# SkillOpt: Executive Strategy for Self-Evolving Agent Skills
+
+**Authors:** Yifan Yang*, Ziyang Gong*, Weiquan Huang*, Qihao Yang*, Ziwei Zhou*, Zisu Huang*, Yan Li, Xuemei Gao, Qi Dai, Bei Liu, Kai Qiu, Yuqing Yang, Dongdong Chen, Xue Yang, Chong Luo (* equal contribution)
+**Affiliation:** Microsoft, SJTU, Tongji, Fudan
+**arXiv:** [2605.23904](https://arxiv.org/abs/2605.23904) (v2, 25 May 2026)
+**Code:** https://aka.ms/SkillOpt
+
+## Abstract
+
+Agent skills today are hand-crafted, generated one-shot, or evolved through loosely controlled self-revision—none of which behaves like a deep-learning optimizer for the skill, and none of which reliably improves over its starting point under feedback. We argue the skill should instead be trained as the external state of a frozen agent. SkillOpt is the first systematic controllable text-space optimizer for agent skills: a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update make skill training stable while adding zero inference-time model calls at deployment. Across six benchmarks, seven target models, and three execution harnesses, SkillOpt is best or tied on all 52 evaluated cells and beats every per-cell competitor. On GPT-5.5 it lifts the average no-skill accuracy by +23.5 points in direct chat, by +24.8 inside Codex, and by +19.1 inside Claude Code.
+
+## Key Contributions
+
+1. **Text-space optimizer**: First systematic optimizer for agent skills with deep-learning-style controls (learning rate, validation gate, momentum)
+2. **52/52 best/tied**: Across 6 benchmarks × 7 models × 3 harnesses
+3. **Cross-domain transfer**: Skills trained on one model/harness/benchmark transfer positively to others
+4. **Compact artifacts**: 300–2,000 tokens after 1–4 accepted edits
--- a/raw/papers/zhou-agent-symbolic-learning-2024.md
+++ b/raw/papers/zhou-agent-symbolic-learning-2024.md
@@ -0,0 +1,27 @@
+---
+title: "Symbolic Learning Enables Self-Evolving Agents"
+created: 2026-05-29
+type: paper-raw
+arxiv: "2406.18532"
+authors: ["Wangchunshu Zhou", "Yixin Ou", "Shengwei Ding", "Long Li", "Jialong Wu", "Tiannan Wang", "Jiamin Chen", "Shuai Wang", "Xiaohua Xu", "Ningyu Zhang", "Huajun Chen", "Yuchen Eleanor Jiang"]
+venue: "arXiv preprint (cs.CL), June 2024"
+affiliation: "AIWaves Inc."
+tags: ["agent", "symbolic-learning", "self-evolving", "optimization"]
+---
+
+# Symbolic Learning Enables Self-Evolving Agents
+
+**Authors:** Zhou et al. (AIWaves, 2024)
+**arXiv:** [2406.18532](https://arxiv.org/abs/2406.18532)
+**Code:** https://github.com/aiwaves-cn/agents
+
+## Abstract
+
+The AI community has been exploring a pathway to AGI by developing "language agents". A fundamental limitation is that current agent research is model-centric/engineering-centric — progress requires substantial manual engineering. Agent symbolic learning introduces a systematic framework that enables language agents to optimize themselves in a data-centric way using symbolic optimizers. Agents are considered as symbolic networks where learnable weights are defined by prompts, tools, and pipeline structure.
+
+## Key Contributions
+
+1. **Agent as Symbolic Network**: Pipeline = computation graph, Nodes = layers, Prompts/Tools = weights
+2. **Symbolic Back-Propagation**: Language Loss propagated backward through the pipeline → Language Gradients for each node
+3. **Holistic Joint Optimization**: All symbolic components optimized together, avoiding local optimum
+4. **Self-Evolving**: Language Loss doesn't need ground-truth, enabling learning after deployment