20260617:目前有914 页

This commit is contained in:
2026-06-17 15:02:40 +08:00
parent e96b955fda
commit 91fac5b6fc
423 changed files with 20687 additions and 34 deletions

View File

@@ -0,0 +1,64 @@
---
title: "窃取无穷的数学家 (The Man Who Stole Infinity)"
source: "Quanta Magazine / 环球科学 2026年6月刊"
author: "约瑟夫·豪利特 (Joseph Howlett)"
translator: "王祎(南开大学哲学院逻辑学博士研究生)"
reviewer: "李娜(南开大学哲学院逻辑学教授)"
date: 2026-06
type: article
tags: [数学史, 集合论, 无穷, 康托尔, 狄德金, 学术伦理]
url: "https://mp.weixin.qq.com/s/xJwwHWAbBsS8NWiNeLbtNQ"
original_url: "https://www.quantamagazine.org/the-man-who-stole-infinity-2026/"
---
# 窃取无穷的数学家
> 原刊于 Quanta Magazine原标题 "The Man Who Stole Infinity"由《环球科学》2026年6月刊翻译发表。
## 概述
1874年格奥尔格·康托尔Georg Cantor发表了一篇改变数学史的论文证明了无穷也有大小之分开创了集合论。然而2025年新发现的一批信件揭示康托尔这篇里程碑式的论文隐藏了另一位数学家——里夏德·狄德金Richard Dedekind的关键贡献。
## 关键历史节点
### 1872年独立而平行的突破
康托尔和狄德金各自独立地发表了关于实数定义的论文,重新定义了数轴——证明实数构成了一个没有"缝隙"的完备连续统。
### 1872年夏盖尔绍之遇
两人在瑞士盖尔绍湖畔初次相遇一见如故在湖边漫步讨论数学。27岁的康托尔性格豪爽、急于发表40岁的狄德金内敛审慎、不急求成。
### 1873年合作与背叛
康托尔在探索无穷问题时与狄德金频繁通信。狄德金回信提供了代数数可数性证明的关键简化——即代数数集合与整数集合同样大小。康托尔将其纳入自己的论文,同时加入了自己关于实数不可数的证明。
### 1874年论文发表
康托尔将论文投稿至《克雷勒杂志》Crelle's Journal。为避开编辑委员会中反无穷派数学家利奥波德·克罗内克尔Leopold Kronecker的阻挠康托尔选择了误导性的标题将狄德金的代数数证明作为"特洛伊木马"放在前面,将自己的实数不可数证明藏在后面。他抹去了狄德金贡献的一切痕迹。
### 1930年代诺特揭露真相
埃米·诺特Emmy Noether在整理狄德金遗作时发现了关键信件。狄德金在私人笔记中写道他的两个证明"几乎一字不差地"以康托尔的名义发表。诺特和卡瓦利斯选择让信件本身说明一切,未公开指控。
### 2025年失踪信件的发现
科学记者德米安·戈斯Demian Goos在哈雷大学的档案中发现了被认为已遗失的狄德金1873年11月30日写给康托尔的信。这封信直接证明了狄德金的关键贡献。
## 核心数学内容
- **代数数的可数性**:狄德金证明了代数数集合与整数集合之间存在一一对应,即代数数是可数的。
- **实数的不可数性**:康托尔证明了实数集合的元素多于整数集合,即实数不可数。
- **无穷层级体系**:这两个结果共同奠定了"存在不同大小的无穷"这一革命性论断。
## 学术伦理讨论
- 康托尔的声誉未因此事而受损——他仍然是第一个证明实数不可数的人
- 狄德金长期处于历史阴影中,至今无英文传记
- "每一门科学分支都需要一位英雄,但这种故事总是谎言。"——何塞·费雷罗斯
## 核心概念
- [[georg-cantor|格奥尔格·康托尔]]
- [[richard-dedekind|里夏德·狄德金]]
- [[infinity-hierarchy|无穷层级体系]]
- [[countable-uncountable-infinity|可数与不可数无穷]]
- [[algebraic-numbers-countability|代数数的可数性]]
- [[emmy-noether|埃米·诺特]]
- [[leopold-kronecker|利奥波德·克罗内克尔]]
- [[mathematical-priority-disputes|数学优先权争议]]
- [[set-theory-history|集合论史]]

View File

@@ -0,0 +1,86 @@
---
title: "从LLM到世界模型Yann LeCun的AI架构判断Datawhale"
source: https://mp.weixin.qq.com/s/Zau10ioTWzhj0KOImpasNg
authors: ["徐虎", "李盛康", "蒋银河", "黎又榛"]
organization: Datawhale
date: 2026-06
type: article
tags: [LLM, JEPA, world-model, VLA, objective-driven-AI, LeCun, representation-collapse, SIGReg, Tapestry]
---
# 从LLM到世界模型Yann LeCun的AI架构判断
> Datawhale DIY-LLM 开源项目拓展篇,系统梳理 LeCun 对 LLM 未来方向的判断。
> 项目地址: https://github.com/datawhalechina/diy-llm
## 核心结论
1. **LLM不是终点但不会消失** — 它会长期作为"语言与知识接口层"存在,是智能系统的"语言皮层",而非完整大脑。
2. **"下一词元预测 + 规模化"很难通向通用智能** — 核心缺口:预测行动后果的能力 + 基于搜索的多步规划。
3. **VLA在当前范式下已接近失败** — LeCun直接判断"VLA pretty much seen as a failure",核心原因是可靠性不足、数据依赖过重、泛化脆弱。
4. **世界模型的关键不是"画出世界",而是"在抽象表征空间预测可控后果"** — 水瓶类比精准揭示了像素级预测的无效性。
5. **JEPA的价值在于把学习目标从重建细节转向可预测的语义状态** — 成败关键在于防止表示坍缩,当前最有前景的路径是 SIGReg。
6. **LLM本质上不安全且在当前范式下无法根本修复** — 目标驱动AIObjective-Driven AI才是安全可控智能体的正确架构。
7. **开源生态最终会赢得平台战争** — Tapestry 联邦训练机制是 LeCun 对主权AI问题的工程回应。
8. **未来更可能是双系统分工** — LLM负责语言与知识交互世界模型负责理解物理世界与规划行动。
## 全文章节
### 一、为什么LLM不是终点?
- 1.1 有意义但不是正确的路线洗车问题案例LLM缺少物理约束建模
- 1.2 LLM为什么会成功离散token + 可计算预测目标)
- 1.3 规模化或已触及天花板高质量文本数据约300万亿Token数据瓶颈2025-2030
### 二、两个核心缺口
- 缺少预测行动后果的能力
- 缺少基于搜索的多步规划
- 这两个缺口不能通过"打补丁"RAG、Tool Use、CoT等修复
### 三、VLA为什么这条路走不通
- VLA失败四个层面可靠性、数据成本、泛化、规划
- 产业界仍押注VLA的三个现实原因
- VLA的适用边界受控场景有效无法成为通用机器人底座
### 四、世界模型核心概念与JEPA架构
- 4.1 世界模型定义:让智能体预测自身行动后果的事物
- 4.2 水瓶类比:为什么不能用像素级预测
- 4.3 生成式世界模型 vs JEPA关键分叉
- 4.4 LeWorldModel编码器(ViT-Tiny) + 预测器(Transformer) + SIGReg正则化
- 4.5 工业应用:世界模型的近期价值
### 五、表征坍缩JEPA最难的技术问题
- 5.1 定义:模型找到"作弊解",所有输入映射为同一向量
- 5.2 三条路线:对比学习、蒸馏方法(BYOL/DINO)、显式正则化(VICReg→SIGReg)
- 5.3 SIGReg核心Cramér-Wold定理 → 强制嵌入分布匹配各向同性高斯分布 N(0,I)
### 六、LLM的不安全性与目标驱动AI的出路
- LLM本质上不安全无法阻止幻觉、无法预测行动后果
- 目标驱动AI通过优化找到最小化代价函数的行动序列"从构造上无法违反"
- 事前规划 vs 事后约束
### 七、Tapestry与主权AI
- 信息食谱与认知主权问题
- Tapestry联邦训练共享参数向量而非数据
- Sun Microsystems类比开源终将胜出
### 八、多层分工的系统图景
- LLM层语言与知识接口→ 世界模型层(预测与规划)→ 目标驱动决策层
- 系统一(LLM/快速模式匹配) vs 系统二(世界模型/后果模拟)
- 范式转变预测2027年初共识形成
## 关键引用
- "智能不是关于预测下一个token而是关于预测行动的后果。"
- "大语言模型本质上是不安全的,因为它们无法预测其行动后果。"
- "当前形式的大语言模型无法变得可靠,因为无法阻止它们幻觉。"
- "VLA现在基本上被视为失败。"
- "目标驱动AI从构造上就无法违反安全约束。"
## 参考资料
- LeWorldModel Paper: https://arxiv.org/abs/2603.19312
- When Does LeJEPA Learn a World Model?: https://arxiv.org/abs/2605.26379
- LeJEPA: Provable and Scalable SSL: https://arxiv.org/pdf/2511.08544.pdf
- Project Tapestry: https://thealliance.ai/projects/tapestry
- VLATest: https://dl.acm.org/doi/10.1145/3729343
- LIBERO-Plus: https://arxiv.org/html/2510.13626v3

View File

@@ -0,0 +1,32 @@
---
source_url: https://mp.weixin.qq.com/s/jg6lW3ObZooBsrWTGwIcRg
ingested: 2026-06-10
---
# 用了两年 Pydantic我只碰了三分之一
> 微信公众号文章 | 2026年
> 拆解 Pydantic 生态三件套pydantic-core (Rust 验证引擎) + Logfire (OTel 可观测) + Pydantic AI (类型安全 Agent 框架)
## 核心观点
Pydantic 不是校验库——是一个由三层组成的生态:
1. **pydantic-core (Rust)**:校验速度 / 脱离 GIL / 多线程并发
2. **Logfire (OTel)**:可观测性 / 成本监控 / 漂移检测
3. **Pydantic AI**Agent 行为约束 / 类型安全的 tool 调用
## 关键洞察
- **数据源变了**2018 年校验的是人填的表单错误模式稳定2026 年校验的是 LLM 生成的 JSON错误模式漂移
- **从"校验"到"可观测"**不能只看单次报错要看趋势——哪些字段在漂移、哪个模型输出最不稳定、token 成本是否在涨
- **工厂质检类比**手工抽检V1→ 传送带自动扫描strict=True→ IoT 传感器 + 实时看板(三件套全开)
- **TypeAdapter**同一份数据不同严格度——API 入口用 strictAgent 内部传递用宽松
- **strict/forbid/frozen 三配置零成本**:不需要装新包,只改 model_config
- **类型从"报错器"变"编译器"**Pydantic AI 的类型系统在运行时之前就约束了 Agent 的行为空间
- **诚实边界**:只做 API 校验 → 继续用 pydantic排障靠 print → 加 Logfire5+ tool Agent → 考虑 Pydantic AI
## 渐进路线图
1. 今天:所有 BaseModel 加 strict + forbid + validate_default
2. 这周(如有 Agent装 Logfire4 行代码
3. 下次新 Agent 项目tool > 3 时用 Pydantic AI

View File

@@ -0,0 +1,39 @@
---
source_url: https://mp.weixin.qq.com/s/UnA-OLSc0mVqe7KyBX7yJw
ingested: 2026-06-14
sha256: skip
---
# 金融行业大模型落地实践:从知识工程到后训练部署
**分享嘉宾:** 王元,奇富科技 DeepBank 算法组负责人
**活动:** 2026 DA 上海站
**出品社区:** DataFun
**校对:** 韩珊珊
## 全文摘要
金融行业是大模型落地的"深水区":业务逻辑复杂、数据合规严格、算力预算有限。通用大模型进入银行或金融科技公司的生产环境,面临无标注数据、无操作手册、无充裕 GPU、甚至"标准答案"缺失的窘境。
## 核心内容
### 冰山难题
- 零数据困境:输入 X 和标签 Y 都不存在,监督微调无法启动
- 评估盲区:生成式输出缺乏标准答案,难以客观量化评估
- 算力与合规壁垒:必须本地化部署,受限硬件预算
### 数据与知识工程
- 基于 REER 算法的逆向知识提炼:从 QA 对中反向提取业务手册的四步流程
- 多维合成数据策略:客户/场景/录制人三维度构建训练数据多样性
- LLM Wiki 方法:参考 Anthropic Captain 的 Markdown+Git 知识库方案
### 后训练与部署
- APO 自动提示工程:作为高质量 Base Prompt 的基线生成器
- 后训练成本博弈SFT < 后置推理 RL < 前置推理 RL
- MOE 模型 + LoRA 工具链冲突VeRL 不支持
- AI Agent 辅助模型训练自动化
- 推理加速MOE 架构 + Int8 量化 + vLLM
### 情绪价值评估
- "先看着对后用着有效果"心理学方法构建评估器
- 在商业签单阶段优先提供情绪价值再追求硬指标

View File

@@ -0,0 +1,35 @@
---
source_url: https://openreview.net/forum?id=SXgGKkShhT
ingested: 2026-06-16
sha256: placeholder
---
# Advances in Temporal Point Processes: Bayesian, Neural, and LLM Approaches
**Authors:** Feng Zhou (Renmin Univ.), Quyu Kong (Independent), Jie Qiao (Guangdong Univ. of Tech.), Cheng Wan (Renmin Univ.), Yixuan Zhang (Southeast Univ.), Ruichu Cai (Guangdong Univ. of Tech.)
**Venue:** Transactions on Machine Learning Research (TMLR), June 2026
**OpenReview:** https://openreview.net/forum?id=SXgGKkShhT
## Abstract
Temporal point processes (TPPs) are stochastic process models used to characterize event sequences occurring in continuous time. Traditional statistical TPPs have a long-standing history, with numerous models proposed and successfully applied across diverse domains. In recent years, advances in deep learning have spurred the development of neural TPPs, enabling greater flexibility and expressiveness in capturing complex temporal dynamics. The emergence of large language models (LLMs) has further sparked excitement, offering new possibilities for modeling and analyzing event sequences by leveraging their rich contextual understanding. This survey presents a comprehensive review of recent research on TPPs from three perspectives: Bayesian, deep learning, and LLM approaches. We begin with a review of the fundamental concepts of TPPs, followed by an in-depth discussion of model design and parameter estimation techniques in these three frameworks. We also revisit classic application areas of TPPs to highlight their practical relevance. Finally, we outline challenges and promising directions for future research.
## Taxonomy (Figure 1)
1. **TPP Preliminaries** — Unmarked TPP, Marked TPP, conditional intensity function
2. **Bayesian TPPs** — Parametric Bayesian TPPs, Bayesian Nonparametric Poisson Process, Bayesian Nonparametric Hawkes Process
3. **Neural TPPs** — Recurrent Neural TPPs, Autoregressive (Transformer) TPPs, Diffusion-based TPPs, Parameterization choices
4. **LLM-based TPPs** — LLM-inspired TPPs (PromptTPP, LAMP), Direct LLM-TPP Integration (TPP-LLM, Language-TPP), Multimodal extensions
5. **Datasets & Benchmarks** — EasyTPP, DanmakuTPPBench
6. **Training Methods** — MLE, Wasserstein, NCE, Score Matching, Fisher Divergence
7. **Applications** — Event Prediction (social, epidemiology, finance, recommendation), Causal Discovery (neuroscience, finance, AI ops, cybersecurity)
8. **Challenges** — Data/model heterogeneity, interpretability, scalability, sampling efficiency, multimodal modeling
## Key Contributions
1. First survey to cover TPPs across Bayesian, neural, AND LLM paradigms in a unified framework
2. Emphasis on Bayesian nonparametric TPPs (overlooked in prior surveys)
3. Systematic review of LLM-based TPPs (nascent area, not previously surveyed)
4. Comprehensive taxonomy bridging statistical rigor, neural flexibility, and LLM capabilities

View File

@@ -0,0 +1,43 @@
---
title: "Principled Uncertainty in Clinical AI: End-to-End Bayesian Modelling and Algorithmic Equity Auditing Across Multimodal Patient Data"
source: "arXiv:2606.09789v1"
authors: "Oladimeji Anthonio, Dimeji Abdulsobur Olawuyi, Oloruntoba Ajayi, Temiloluwa Aderemi, Joseph Odamo"
affiliation: "Centre for Algorithmic Health Equity, University of Ibadan, FUTA"
year: 2026
category: "cs.CY"
published: "2026-06-08"
---
# Principled Uncertainty in Clinical AI: End-to-End Bayesian Modelling and Algorithmic Equity Auditing Across Multimodal Patient Data
**Authors**: Oladimeji Anthonio*, Dimeji Abdulsobur Olawuyi, Oloruntoba Ajayi, Temiloluwa Aderemi, Joseph Odamo
**arXiv**: 2606.09789v1 [cs.CY]
**Published**: 2026-06-08
**Affiliation**: Centre for Algorithmic Health Equity, Ìyàwó, Ibadan; University of Ibadan; FUTA Akure
## Abstract
Clinical artificial intelligence (AI) systems routinely produce predictions without principled quantification of uncertainty, limiting their trustworthiness in high-stakes medical environments. This paper presents an integrated research programme addressing two interconnected problems: (1) the development of a fully end-to-end Bayesian uncertainty modelling framework for multimodal clinical data, and (2) the application of calibrated uncertainty estimates as a formal measure of algorithmic equity across patient subgroups.
The architecture comprises modality-specific variational encoders, a precision-weighted late fusion mechanism, and a decomposed uncertainty output head that separates aleatoric from epistemic uncertainty. The system is trained with a composite Bayesian loss incorporating binary cross-entropy, KL divergence regularisation, and an uncertainty calibration penalty.
**Key Results**:
- ECE = 0.096 (well-calibrated)
- Primary/rural facility patients: 15.3% uncertainty equity gap (p < 0.001, r = 0.698)
- Low SES patients: 6.8% gap (p < 0.001, r = 0.617)
- Elderly patients: 3.9% gap (p < 0.001)
- No significant sex-based disparity detected
## Key Concepts
- [[epistemic-uncertainty]] reducible, model-knowledge uncertainty
- [[aleatoric-uncertainty]] irreducible, data-noise uncertainty
- [[uncertainty-quantification]] probabilistic prediction framework
- [[bayesian-deep-learning]] variational inference in neural networks
- [[expected-calibration-error]] calibration metric (ECE)
- [[uncertainty-equity-gap]] UEG equity metric
- [[uncertainty-disparity-ratio]] UDR equity metric
- [[precision-weighted-fusion]] multimodal late fusion
- [[mc-dropout]] Monte Carlo Dropout for uncertainty
- [[algorithmic-equity]] algorithmic fairness
- [[clinical-ai]] clinical artificial intelligence
- [[variational-autoencoder]] VAE foundation

View File

@@ -0,0 +1,37 @@
---
title: "Minimax-Optimal Policy Regret in Partially Observable Markov Games"
source: "arXiv:2606.02363v1"
authors: "Raman Arora"
affiliation: "Johns Hopkins University"
year: 2026
category: "cs.LG, stat.ML"
published: "2026-06-01"
venue: "ICML 2026"
---
# Minimax-Optimal Policy Regret in Partially Observable Markov Games
**Author**: Raman Arora (Johns Hopkins University)
**arXiv**: 2606.02363v1 [cs.LG, stat.ML]
**Venue**: ICML 2026, Seoul
**Published**: 2026-06-01
## Abstract
We study sequential decision-making in partially observable environments against strategic, adaptive opponents, modeled as partially observable Markov games (POMGs). The central challenge is to learn latent dynamics from partial observations while facing an adversary whose behavior depends on the learner's strategy, making standard regret notions inadequate.
We prove that an epoch-based optimistic maximum-likelihood algorithm achieves O~(sqrt(T)) policy regret, with explicit dependence on the horizon, adversary memory, confidence radius, and the aggregate Eluder dimension of the observable-operator class. A matching lower bound confirms minimax optimality. Extensions include horizon-adaptive guarantees and adversaries with geometric fading memory.
## Key Concepts
- [[partially-observable-markov-game|POMG]] — core model: partial observability + strategic adversary
- [[policy-regret|Policy Regret]] — counterfactual regret against adaptive opponents
- [[eluder-dimension|Eluder Dimension]] — sequential complexity measure
- [[observable-operator-model|OOM]] — operator-based representation of POMG dynamics
- [[posterior-lipschitz-adversary|Posterior-Lipschitz Adversary]] — smoothness assumption
- [[weak-revealing-condition|Weak Revealing]] — observation informativeness condition
- [[causal-decomposition-pomg|Causal Decomposition]] — separating world from adversary
- [[epoch-based-optimistic-mle|Epoch-based Optimistic MLE]] — the algorithm
- [[minimax-optimality|Minimax Optimality]] — matching upper and lower bounds
- [[pomdp|POMDP]] — single-agent precursor
- [[adaptive-adversary|Adaptive Adversary]] — strategic opponent model
- [[fading-memory|Fading Memory]] — adversary memory extension

View File

@@ -0,0 +1,28 @@
---
title: "BellmanTaylor Score Decoding for MDPs with State-Dependent Feasible Action Sets"
source_url: https://arxiv.org/abs/2606.10979
ingested: 2026-06-17
sha256: <computed>
---
# BellmanTaylor Score Decoding for MDPs with State-Dependent Feasible Action Sets
**Authors:** Yi Chen, Rushuai Yang, Qiang Chen, Dongyan (Lucy) Huo — HKUST, Dept. of IEDA
**arXiv:** 2606.10979v1 [cs.AI] (2026-06-09)
## Abstract
Proposes BellmanTaylor score decoding, a framework that moves policy learning to a Euclidean score space while enforcing feasibility through an action decoder. Motivated by a Taylor expansion of the optimal action-value function. The induced latent-score MDP can then be optimized by standard DRL algorithms without differentiating through the decoder. Provides a performance guarantee: optimality gap = structural approximation error + algorithmic learning error. Applied to queueing network control, learning a state-dependent index-based dispatching rule.
## Key Concepts
- [[bellman-taylor-score-decoding|Bellman-Taylor 得分解码]]
- [[latent-score-mdp|潜在得分 MDP]]
- [[state-dependent-feasible-action-sets|状态依赖可行动作集]]
- [[action-decoder|动作解码器]]
- [[post-action-configuration|后动作配置]]
- [[taylor-expansion-q-function|Q 函数 Taylor 展开]]
- [[queueing-network-control|排队网络控制]]
- [[btsd-ppo|BTSD-PPO]]
- [[continuation-value-function|延续价值函数]]

View File

@@ -0,0 +1,25 @@
# Token Economics for LLM Agents: A Dual-View Study from Computing and Economics
**Authors:** Yuxi Chen\*, Junming Chen\*, Chenyu He\*, Yiwei Li\*, Yicheng Ji\*, Yifan Wu\*, Dingyu Yang, Lansong Diao, Lidan Shou, Hongliang Zhang, Huan Li, Gang Chen
**Affiliations:** Zhejiang University (CS + Economics), Alibaba Cloud
**arXiv:** [2605.09104](https://arxiv.org/abs/2605.09104) (v1, May 2026)
**Venue:** cs.AI (Survey)
**GitHub:** [SuDIS-ZJU/Token-Economics](https://github.com/SuDIS-ZJU/Token-Economics)
---
## Abstract
As LLM agents evolve, tokens have emerged as the core economic primitives of Agentic AI. However, their exponential consumption introduces severe computational, collaborative, and security bottlenecks. Current surveys remain fragmented across system optimization, architecture design, and trust, lacking a unified framework to evaluate the fundamental trade-off between output quality and economic cost. To bridge this gap, this survey presents the first comprehensive survey of **Token Economics**. By unifying computer science and economics, we conceptualize tokens as **production factors, exchange mediums, and units of account**. We synthesize existing literature across a **four-dimensional taxonomy**: (1) Micro-level (Single Agent): Optimizing budget-constrained factor substitution via neoclassical firm theory. (2) Meso-level (Multi-Agent Systems): Minimizing collaboration friction using transaction cost and principal-agent theories. (3) Macro-level (Agent Ecosystems): Addressing congestion externalities and pricing via mechanism design. (4) Security: Internalizing adversarial threats as endogenous economic constraints. Finally, we outline frontier directions, including differentiable token budgets and dynamic markets.
## Key Concepts
- [[token-economics]] — the unified dual-view framework
- [[token-as-economic-primitive]] — tokens as production factors, exchange mediums, units of account
- [[micro-level-token-economics]] — single-agent budget-constrained optimization
- [[meso-level-token-economics]] — multi-agent collaboration friction
- [[macro-level-token-economics]] — ecosystem-level congestion and pricing
- [[token-security-economics]] — adversarial threats as endogenous constraints
- [[agent-token-budget-optimization]] — factor substitution and budget allocation
- [[differentiable-token-budgeting]] — frontier: learnable token budgets
- [[token-market-dynamics]] — real-time token markets and dynamic pricing

View File

@@ -0,0 +1,39 @@
---
source_url: https://arxiv.org/abs/2606.13655
ingested: 2026-06-13
sha256: flex4dhuman-raw-v1
---
# Flex4DHuman: Flexible Multi-view Video Diffusion for 4D Human Reconstruction
**arXiv:** 2606.13655
**Authors:** Jen-Hao Cheng (UW), Yipeng Wang (World Labs), Hao Zhang (World Labs), Gengshan Yang (World Labs), Jenq-Neng Hwang (UW)
**Categories:** cs.CV, cs.GR
**Published:** 2026-06-11
## Abstract
We present Flex4DHuman, a multi-view video diffusion model that transforms a monocular or sparse multi-view video of a dynamic subject into synchronized dense multi-view videos using only relative camera-pose conditioning. Unlike prior human-centric methods that rely on skeletons, depth maps, normals, or rendered target-view geometry, Flex4DHuman requires no explicit geometry priors and instead conditions generation through relative camera-pose positional encoding. The generated videos can be directly ingested by downstream reconstruction pipelines to create dynamic 4D Gaussian splats. Built on Wan 2.1's 1.3B text-to-video model, Flex4DHuman preserves the backbone architecture and encodes camera and view information through a five-axis positional encoding that extends spatio-temporal RoPE with view indices and continuous SE(3) relative camera geometry. A three-stage curriculum progressively trains the model for pose following, flexible reference-to-target view generation, and temporal rollout. To support temporal rollout, we train with clean historical target-view tokens. We also add multi-view captions to enable test-time text control. Combined with an off-the-shelf 4D Gaussian Splatting stage, our framework lifts monocular static-camera videos into dynamic 4D Gaussian splats.
## Key Contributions
1. **Multi-view video diffusion without explicit geometry priors** — Adapts Wan 2.1 using only relative camera-pose positional encoding
2. **Flexible synchronized generation** — Supports monocular and variable sparse-view inputs, arbitrary target viewpoints, and temporal rollout
3. **Monocular video to 4D Gaussian splats** — Generated multi-view videos feed into FreeTimeGS for dynamic reconstruction
## Key Concepts
- [[five-axis-positional-encoding]]: (time, view, SE(3), h, w) RoPE extension
- [[se3-relative-camera-encoding]]: Continuous SE(3) camera geometry via PRoPE
- [[clean-conditioning-mask]]: Binary mask distinguishing reference vs target tokens
- [[three-stage-curriculum-training]]: Stage 1 pose following → Stage 2 dynamic refs → Stage 3 temporal rollout
- [[temporal-rollout]]: Chunked inference with teacher-forced history overlap
- [[multi-view-captioning]]: Gemini 3 Flash generated appearance captions
## Results
- DNA-Rendering: +1.21 dB PSNR over Diffuman4D-GT-skeleton (25.44 dB)
- Zero-shot ActorsHQ: +3.35 dB PSNR over Diffuman4D-mono-skeleton (21.32 dB)
- Generalizes to animals (DFA) after fine-tuning
- Robust to reference view azimuth (<1 dB variation)
- Monotonically improves with more reference views

View File

@@ -0,0 +1,29 @@
---
title: "On the fibers and semi-algebraicity of ReLU neuromanifolds"
source: "arXiv:2606.02826v1"
authors: "Axel Flinth, Stefano Mereta, Michele Pernice"
affiliation: "KTH Royal Institute of Technology / WASP"
year: 2026
category: "math.AG"
published: "2026-06-01"
---
# On the fibers and semi-algebraicity of ReLU neuromanifolds
**Authors**: Axel Flinth, Stefano Mereta, Michele Pernice
**arXiv**: 2606.02826v1 [math.AG]
**Published**: 2026-06-01
## Abstract
We study the semi-algebraicity of the neuromanifold M_d of a feedforward ReLU neural network and its symmetries. We prove that M_d is not a semi-algebraic quotient of the space of weights. We introduce honest open subsets where the network shows no hidden symmetries, conjecture the maximal honest open is always semi-algebraic, and prove it is Zariski open in the shallow case.
## Key Concepts
- [[neuromanifold|Neuromanifold]] — function space parametrized by network weights
- [[neuroalgebraic-geometry|Neuroalgebraic Geometry]] — algebro-geometric study of neural networks
- [[semi-algebraic-set|Semi-algebraic Set]] — sets defined by polynomial equalities/inequalities
- [[honest-open-subset|Honest Open Subset]] — region free of hidden symmetries
- [[hidden-symmetries-neural|Hidden Symmetries]] — symmetries beyond scaling and permutation
- [[parametrization-map|Parametrization Map]] — weight-to-function mapping
- [[scaling-permutation-symmetry|Scaling & Permutation Symmetries]] — trivial NN symmetries
- [[fiber-of-parametrization|Fiber of Parametrization]] — preimage of a function

View File

@@ -0,0 +1,43 @@
---
title: "One-Pass to Reason: Token Duplication and Block-Sparse Mask for Efficient Fine-Tuning on Multi-Turn Reasoning"
authors: "Ritesh Goru, Shanay Mehta, Prateek Jain"
venue: "ICML 2025 Workshop: 3rd Workshop on Efficient Systems for Foundational Models"
year: 2025
arxiv: "2504.18246"
code: "https://github.com/devrev/One-Pass-to-Reason"
dataset: "https://huggingface.co/datasets/devrev-research/MathChatSync-reasoning"
type: paper
tags: [efficient-fine-tuning, multi-turn-reasoning, attention-mask, token-duplication, single-pass-training]
---
## Abstract
Fine-tuning Large Language Models (LLMs) on multi-turn reasoning datasets requires N (number of turns) separate forward passes per conversation due to reasoning token visibility constraints, as reasoning tokens for a turn are discarded in subsequent turns. We propose duplicating response tokens along with a custom attention mask to enable single-pass processing of entire conversations. We prove our method produces identical losses to the N-pass approach while reducing time complexity from O(N³) to O(N²) and maintaining the same memory complexity for a transformer based model. Our approach achieves significant training speedup while preserving accuracy. Our implementation is available online.
## Core Problem
Reasoning models (e.g., DeepSeek-R1) generate internal reasoning tokens, produce a response, and then discard the reasoning tokens from context in subsequent turns. This creates:
1. **Visibility Constraints**: Reasoning tokens must be visible during generation but hidden from subsequent turns — static attention masks cannot satisfy this
2. **Position ID Discrepancy**: Response tokens follow reasoning tokens during generation but directly follow human messages in later context
## Method
1. **Token Duplication**: Duplicate response tokens so ri_in (context copy) does not attend to reasoning, while ri_out (generation copy) does
2. **Custom Block-Sparse Attention Mask**: Single mask with visibility rules per token type
3. **Strategic Position ID Assignment**: Maintains correct relative positions equivalent to N-pass
4. **Theorem 2.1**: Proves loss equivalence L_N-Pass(c) = L_1-Pass(c)
## Results
- 1.05×1.22× faster than FlashAttention-2 N-Pass with packing (Qwen-3 4B, 8B, 32B)
- 1.44×1.54× faster than FlexAttention N-Pass with packing
- ~33% more GPU memory
- Speedups grow with conversation depth (O(N²) vs O(N³) theoretical advantage)
- K-Pass variant allows speedmemory trade-off
## Key Contributions
1. Theoretical framework for single-pass multi-turn reasoning training
2. MathChatSync Reasoning dataset (first public multi-turn reasoning dataset with explicit per-turn reasoning)
3. Comprehensive empirical validation on Qwen-3 models using QLoRA

View File

@@ -0,0 +1,29 @@
# Auditing Agent Harness Safety
**Authors:** Chengzhi Liu\*, Yichen Guo\*, Yepeng Liu, Yuzhe Yang, Qianqi Yan, Xuandong Zhao, Wenyue Hua, Sheng Liu, Sharon Li, Yuheng Bu, Xin Eric Wang
**Affiliations:** UC Santa Barbara, UC Berkeley, Stanford University, UWMadison, Microsoft Research
**arXiv:** [2605.14271](https://arxiv.org/abs/2605.14271) (v2, May 2026)
**Venue:** cs.CL
**Project Page:** [harnessaudit.github.io](https://harnessaudit.github.io)
---
## Abstract
LLM agents increasingly run inside execution harnesses that dispatch tools, allocate resources, and route messages between specialized components. However, a harness can return a correct, benign answer over a trajectory that accesses unauthorized resources or leaks context to the wrong agent. Output-level evaluation cannot see these failures, yet most safety benchmarks score only final outputs or terminal states, even though many violations occur mid-trajectory rather than at termination. The central question is whether the harness respects user intent, permission boundaries, and information-flow constraints throughout execution. To address this gap, we propose **HarnessAudit**, a framework that audits full execution trajectories across **boundary compliance**, **execution fidelity**, and **system stability**, with a focus on multi-agent harnesses where these risks are most pronounced. We further introduce **HarnessAudit-Bench**, a benchmark of 210 tasks across eight real-world domains, instantiated in both single-agent and multi-agent configurations with embedded safety constraints. Evaluating ten harness configurations across frontier models and three multi-agent frameworks, we find that: (i) task completion is misaligned with safe execution, and violations accumulate with trajectory length; (ii) safety risks vary across domains, task types, and agent roles; (iii) most violations concentrate in resource access and inter-agent information transfer; (iv) multi-agent collaboration expands the safety risk surface, while harness design sets the upper bound of safe deployment.
## Key Concepts
- [[agent-harness-safety]] — the core paradigm
- [[harnessaudit]] — the auditing framework
- [[boundary-compliance]] — L1: tool, resource, information-flow violations
- [[execution-fidelity]] — L2: action validity, checkpointed completion
- [[system-stability]] — L3: perturbation resilience
- [[trajectory-auditing]] — trajectory-level evidence collection
- [[multi-agent-safety]] — multi-agent coordination safety risks
- [[information-flow-control]] — inter-agent communication constraints
- [[resource-access-control]] — resource scope enforcement
- [[safety-adherence-rate]] — SAR scoring metric
- [[policy-constrained-execution]] — formal harness model
- [[execution-harness]] — harness as policy-constrained execution system
- [[hidden-audit-channel]] — agent-independent evidence recording

View File

@@ -0,0 +1,57 @@
---
title: "IntrAgent: An LLM Agent for Content-Grounded Information Retrieval through Literature Review"
type: raw-paper
arxiv: "2604.22861"
year: 2026
authors: "Fengbo Ma, Zixin Rao, Xiaoting Li, Zhetao Chen, Hongyue Sun, Yiping Zhao, Xianyan Chen, Zhen Xiang"
venue: "arXiv 2026"
code: "https://github.com/FengboMa/IntrAgent"
dataset: "https://huggingface.co/datasets/IntrAgent/IntraBench"
project: "https://intragent.github.io/"
---
# IntrAgent: An LLM Agent for Content-Grounded Information Retrieval through Literature Review
**Authors:** Fengbo Ma*, Zixin Rao*, Xiaoting Li, Zhetao Chen, Hongyue Sun, Yiping Zhao, Xianyan Chen†, Zhen Xiang†
**Affiliation:** University of Georgia, Athens, GA, USA
**arXiv:** 2604.22861
**Date:** April 23, 2026
## Abstract
Scientific research relies on accurate information retrieval from literature to support analytical decisions. In this work, we introduce a new task, INformation reTRieval through literAture reVIEW (IntraView), which aims to automate fine-grained information retrieval faithfully grounded in the provided content in response to research-driven queries, and propose IntrAgent, an LLM-based agent that addresses this challenging task. In particular, IntrAgent is designed to mimic human behaviors when reading literature for information retrieval identifying relevant sections and then iteratively extracting key details to refine the retrieved information. It follows a two-stage pipeline: a Section Ranking stage that prioritizes relevant literature sections through structural-knowledge-enabled reasoning, and an Iterative Reading stage that continuously extracts details and synthesizes them into concise, contextually grounded answers. To support rigorous evaluation, we introduce IntraBench, a new benchmark consisting of 315 test instances built from expert-authored questions paired with literature spanning five STEM domains. Across seven backbone LLMs, IntrAgent achieves on average 13.2% higher cross-domain accuracy than state-of-the-art RAG and research-agent baselines.
## Key Contributions
1. **IntraView Task** — A novel task for accurate, automated, and content-grounded information retrieval from a provided scientific literature.
2. **IntrAgent Framework** — An LLM agent with a two-stage pipeline (Section Ranking + Iterative Reading) that mimics human reading behavior.
3. **Hierarchy Preservation** — Leverages structural knowledge of scientific documents for more effective section ranking.
4. **Sufficiency Check** — Mitigates hallucination by explicitly assessing whether accumulated information is adequate to answer the query.
5. **IntraBench** — The first benchmark for evaluating IntraView, with 315 test instances across five domains (physics, earth science, public health, engineering, material science).
## Method Overview
### Section Ranking
1. **Section Heading Parsing**: Convert literature to Markdown with minerU for layout/section detection.
2. **Hierarchy Preservation**: Construct a section tree from headings using LLM-based hierarchy inference.
3. **Reasoning-Based Ranking**: LLM ranks sections by relevance to the research question via structure-aware reasoning.
### Iterative Reading
- **Reordered Section Access**: Read sections in descending relevance order.
- **Section Detail Extraction**: Extract key scientific details (terminology, numbers, experiments, statistics, conclusions).
- **Information Sufficiency Check**: LLM evaluates whether accumulated details are sufficient; terminates or continues reading.
- **Confidence-Based Reading Styles**: Conservative, balanced (default), and aggressive modes to control operational overhead.
- **Final Answer Synthesis**: Synthesize answer from all accumulated details.
## Evaluation
- **IntraBench**: 315 test instances across physics, earth science, public health, engineering, material science.
- **LLM-Grounded Multiple-Choice Evaluation**: LLM maps generated free-form answers to multiple-choice candidates, addressing synonym/abbreviation challenges.
- **Baselines**: RAG systems (vanilla RAG, re-ranking, contextual retrieval) and literature agents (PaperQA2, QASA, SciMaster).
- **Results**: 13.2% average cross-domain accuracy improvement over baselines across 7 backbone LLMs.
## Key Design Insights
- Structural knowledge (section hierarchy) is critical for accurate section ranking — semantic similarity alone insufficient.
- Sufficiency check prevents both hallucination (premature answer with insufficient evidence) and over-reading.
- The framework can handle queries where the answer is NOT present in the literature (through explicit "None of the above" handling).

View File

@@ -0,0 +1,88 @@
---
title: "LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels"
authors: ["Lucas Maes", "Quentin Le Lidec", "Damien Scieur", "Yann LeCun", "Randall Balestriero"]
arxiv: "2603.19312v3"
published: "2026-03-13 (updated 2026-06-03)"
categories: [cs.LG, cs.AI]
affiliations: ["Mila & Université de Montréal", "New York University", "Samsung SAIL", "Brown University"]
source: https://arxiv.org/abs/2603.19312
code: linked in paper
---
# LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels
> Lucas Maes*, Quentin Le Lidec*, Damien Scieur, Yann LeCun, Randall Balestriero (* equal contribution)
## Abstract (原文)
Joint Embedding Predictive Architectures (JEPAs) offer a compelling framework for learning world models in compact latent spaces, yet existing methods remain fragile, relying on complex multi-term losses, exponential moving averages, pre-trained encoders, or auxiliary supervision to avoid representation collapse. In this work, we introduce LeWorldModel (LeWM), the first JEPA that trains stably end-to-end from raw pixels using only two loss terms: a next-embedding prediction loss and a regularizer enforcing Gaussian-distributed latent embeddings. This reduces tunable loss hyperparameters from six to one compared to the only existing end-to-end alternative. With ~15M parameters trainable on a single GPU in a few hours, LeWM plans up to 48× faster than foundation-model-based world models while remaining competitive across diverse 2D and 3D control tasks. Beyond control, we show that LeWM's latent space encodes meaningful physical structure through probing of physical quantities. Surprise evaluation confirms that the model reliably detects physically implausible events.
## 核心贡献
1. **首个无需训练启发式stop-gradient/EMA/预训练编码器)的端到端 JEPA 世界模型**
2. 仅用 **2 个损失项 + 1 个可调超参** λ(对比 PLDM 的 6 个超参)
3. ~15M 参数,单 GPU 数小时训练
4. 规划速度比 DINO-WM 快 **48×**token 数减少 ~200×
5. Push-T 成功率 **96%**PLDM 提升 18%
6. 潜在空间编码有意义的物理结构,可通过 probing 提取物理量
7. Surprise 评估确认能可靠检测物理不合理事件
## 架构
### 编码器
- ViT-Tiny (~5M 参数): Patch 14×14, 12 层, 3 注意力头, 隐藏维 192
- 关键设计: **BatchNorm** 投影头(非 LayerNorm因为 LN 限制方差分布阻碍 SIGReg
### 预测器
- Transformer (~10M 参数): 6 层, 16 注意力头, 10% dropout
- 动作条件通过 **AdaLN**(自适应层归一化)注入,初始化为零实现渐进式影响
- 时间因果掩码自回归预测下一帧表示
### 训练目标
$$\mathcal{L} = \|\hat{Z}_{t+1} - Z_{t+1}\|^2 + \lambda \cdot SIGReg(Z)$$
- 无 stop-gradient区别于 I-JEPA/V-JEPA
- 无 EMA区别于 BYOL/DINO
- 无预训练编码器(区别于 DINO-WM
- SIGReg 通过 Cramér-Wold 定理强制嵌入匹配各向同性高斯分布 N(0,I)
## 关键消融
| 消融 | Push-T 成功率 |
|------|-------------|
| LeWM (完整) | **96.0%** |
| 无 SIGReg 正则化 | 坍缩 (~30%) |
| 无 AdaLN (简单拼接动作) | 下降 |
| BatchNorm → LayerNorm | 下降SIGReg 优化困难) |
## 与现有方法的对比定位
| 方法 | 端到端 | 任务无关 | 像素输入 | 无重建 | 无奖励 | 防坍塌保证 |
|------|--------|---------|---------|--------|--------|----------|
| PLDM | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ (6超参) |
| DINO-WM | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ (冻结编码器) |
| Dreamer | ✅ | ❌ | ✅ | ❌ | ❌ | N/A |
| TD-MPC | ✅ | ❌ | ❌ | ✅ | ❌ | N/A |
| **LeWM** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ (1超参) |
## 局限
1. 当前 latent world model 规划仍局限于**短视界**,自回归误差随规划长度累积
2. 依赖足够交互覆盖度的离线数据集
3. 简单场景中 SIGReg 强制高维高斯先验可能导致表征学习困难
4. 需显式动作标签(可通过逆动力学建模缓解)
5. 实验限于 Push-T、Reacher、TwoRoom、OGBench-Cube 等**低维受控任务**
6. OGBench-Cube 上略逊 SOTADINO-WM 受益于 DINOv2 预训练)
## 意义定位
**JEPA 路线的重要里程碑,而非世界模型问题的最终答案。** 验证了端到端 JEPA 世界模型的工程可行性,是 LeCun 在访谈中唯一推荐的具体世界模型论文。
## 相关概念
- [[leworldmodel]]
- [[jepa]]
- [[sigreg]]
- [[pldm]]
- [[world-model-lecun]]
- [[representation-collapse]]

View File

@@ -0,0 +1,28 @@
---
title: "Learning to Adapt: Representation-Based Reinforcement Learning for Multi-Task Skill Transfer"
source_url: https://arxiv.org/abs/2606.12890
ingested: 2026-06-17
sha256: <computed>
---
# Learning to Adapt: Representation-Based Reinforcement Learning for Multi-Task Skill Transfer
**Authors:** Aryan Naveen (MIT), Haitong Ma, Haldun Balim, Na Li — Harvard SEAS
**arXiv:** 2606.12890v1 [cs.RO] (2026-06-11)
**8 pages, 4 figures, 1 table. NSF AI Institute + ONR.**
## Abstract
Proposes RepMT-SAC, a framework for multi-task RL that enables efficient knowledge sharing and robust transfer to new tasks. Uses spectral MDP decomposition to capture transferable dynamics, structuring the value function into a task-agnostic core with a minimal task-specific adjustment. Allows for strong zero-shot performance on in-distribution tasks and rapid few-shot adaptation to out-of-distribution tasks. Evaluated on quadcopter trajectory-following tasks across in-distribution and out-of-distribution contexts, outperforming baselines by up to 30%.
## Key Concepts
- [[rep-mt-sac|RepMT-SAC]]
- [[spectral-mdp-decomposition|谱 MDP 分解]]
- [[task-invariant-representation|任务不变表征]]
- [[task-conditioned-policy|任务条件策略]]
- [[upstream-downstream-learning|上游-下游学习]]
- [[quadrotor-trajectory-following|四旋翼轨迹跟踪]]
- [[soft-actor-critic|SAC]]

View File

@@ -0,0 +1,25 @@
# Stem: Rethinking Causal Information Flow in Sparse Attention
**Authors:** Lin Niu\*, Xin Luo\*, Linchuan Xie, Yifu Sun, Guanghua Yu, Jianchen Zhu, S Kevin Zhou
**Affiliations:** Tencent, University of Science and Technology of China (USTC)
**arXiv:** [2603.06274](https://arxiv.org/abs/2603.06274) (v1, March 2026)
**Venue:** cs.LG / cs.AI
**Implementation:** Triton-based Block Sparse Attention kernel (open-source)
---
## Abstract
The quadratic computational complexity of self-attention remains a fundamental bottleneck for scaling LLMs to long contexts, particularly during the **pre-filling phase**. In this paper, we rethink the causal attention mechanism from the perspective of **information flow**. Due to causal constraints, tokens at initial positions participate in the aggregation of every subsequent token. However, existing sparse methods typically apply a **uniform top-k selection** across all token positions within a layer, ignoring the cumulative dependency of token information inherent in causal architectures. To address this, we propose **Stem**, a novel, plug-and-play sparsity module aligned with information flow:
1. **Token Position-Decay (TPD)**: position-dependent top-k within each layer — larger budget for initial tokens, aggressive sparsification for later tokens
2. **Output-Aware Metric (OAM)**: prioritizes high-impact tokens based on approximate output magnitude (incorporating Value information), not just attention scores
Stem is **training-free** and can also be integrated into training-based sparse models (DeepSeek-V3.2, MiniCPM-4.1) to further compress the sparse budget. Evaluated on RULER and LongBench with Llama3.1-8B and Qwen3-8B, Stem achieves superior accuracy with reduced pre-filling latency.
## Key Concepts
- [[stem-sparse-attention]] — the Stem framework
- [[causal-information-flow]] — the theoretical perspective
- [[token-position-decay]] — position-dependent sparse budget allocation
- [[output-aware-metric]] — value-aware token selection

View File

@@ -0,0 +1,30 @@
---
title: "Representation Learning Enables Scalable Multitask Deep RL"
source: "arXiv:2606.05555v1"
authors: "Johan Obando-Ceron, Lu Li, Scott Fujimoto, Pierre-Luc Bacon, Aaron Courville, Pablo Samuel Castro"
affiliation: "Mila, Universite de Montreal, McGill, Google DeepMind"
year: 2026
category: "cs.LG, cs.AI"
published: "2026-06-04"
---
# Representation Learning Enables Scalable Multitask Deep RL
**Authors**: Johan Obando-Ceron, Lu Li, Scott Fujimoto, Pierre-Luc Bacon, Aaron Courville, Pablo Samuel Castro
**arXiv**: 2606.05555v1 [cs.LG, cs.AI]
**Affiliations**: Mila / UdeM / McGill / CIFAR / Google DeepMind
**Published**: 2026-06-04
## Abstract
Scaling RL to diverse multitask settings is a central challenge. We argue the primary driver is not model-based control but **representation learning**. Combining predictive model-based representations with high-capacity value function approximation is sufficient — even without planning. MR.Q, a model-free algorithm with auxiliary predictive objectives, outperforms world-model-based methods (Newt) while reducing computational overhead and improving wall-clock efficiency.
## Key Concepts
- [[predictive-representation-learning|Predictive Representation Learning]] — core thesis
- [[mrq-algorithm|MR.Q]] — the model-free agent with predictive objectives
- [[multitask-rl|Multitask RL]] — training across diverse task distributions
- [[representation-learning-rl|Representation Learning in RL]] — beyond reward-only supervision
- [[auxiliary-predictive-objectives|Auxiliary Predictive Objectives]] — dynamics/reward/termination prediction
- [[world-models-rl|World Models in RL]] — model-based comparison point
- [[model-free-rl|Model-Free RL]] — the advocated approach
- [[deep-rl-scaling|Scaling Deep RL]] — the broader goal

View File

@@ -0,0 +1,31 @@
---
source_url: https://arxiv.org/abs/2606.06260
ingested: 2026-06-10
sha256: <pending>
---
# OneReason Technical Report
- **Authors**: OneRec Team (Kuaishou) — Biao Yang, Boyang Ding, Chenglong Chu, Dunju Zang, Fei Pan, Han Li, Hao Jiang, Honghui Bao, Huanjie Wang, Jian Liang, Jiangxia Cao, Jiao Ou, Jiaxin Deng, Jinghao Zhang, Kun Gai, Lu Ren, Peiru Du, Pengfei Zheng, Rongzhou Zhang, Ruiming Tang, Shiyao Wang, Siyang Mao, Siyuan Lou, Teng Shi, Wei Yuan, Wenlong Xu, Xingchen Liu, Xingmei Wang, Xinqi Jin, Yan Sun, Yan Wang, Yifei Hu, Yingzhi He, Yufei Ye, Yuhao Wang, Yunhao Zhou, Yuqin Dai, Zhao Liu, Zhipeng Wei, Zhixin Ling, Ziming Li, Zixing Zhang, Ziyuan Liu, An Zhang, Changxin Lao, Chaoyi Ma, Chengru Song, Defu Lian, Fan Yang, Guowang Zhang, Hao Peng, Jiayao Shen, Jie Chen, Jun Xu, Junmin Chen, Kun Zhang, Kuo Cai, Mingxing Wen, Minmao Wang, Minxuan Lv, Qi Zhang, Qiang Luo, Sheng Yu, Shijie Li, Shijie Yi, Shuang Yang, Shugui Liu, Shuni Chen, Tinghai Zhang, Tingting Gao, Xiang Wang, Xiangyu Wu, Xiangyu Zhao, Xiao Lv, Xiaoyou Zhou, Xuming Wang, Yong Du, Zejian Zhang, Zhaojie Liu, Zhiyang Zhang, Zhuang Zhuang, Ziqi Wang, Ziyi Zhao
- **arXiv ID**: 2606.06260
- **Categories**: cs.IR, cs.AI, cs.CL
- **Published**: 2026-06-04
- **Affiliation**: Kuaishou
- **Status**: Work in progress
## Abstract
Generative recommendation models in the OneRec family have been widely deployed in many real-world services, such as short-video, live-streaming, advertising, and e-commerce. However, these generative models can only benefit from the scaling advantage, while their reasoning ability is hard to activate, since we cannot construct meaningful Chain-of-Thought (CoT) sequences consisting of itemic tokens only. Inspired by the success of the reasoning-style "think before answer" paradigm in the LLM field, we conduct preliminary studies (i.e., OneRec-Think, OpenOneRec) to explore reasoning capability in generative recommendation. Nevertheless, we notice an unexpected phenomenon: the thinking mode does not show advantages over the non-thinking mode. Drawing insights from recent findings on CoT robustness in multi-modal language models, we argue that effective reasoning in recommendation rests on two factors: perception, the ability to ground itemic tokens in their underlying language semantics, and cognition, the ability to reorganize a user's behavior sequence into coherent latent interest points. We therefore propose OneReason, which includes: (1) strong itemic token perception in pre-training, (2) a three-level cognition-enhanced CoT format for recommendation tasks in SFT, and (3) a specialize-then-unify training recipe in RL to enhance the thinking ability.
## Key Contributions
1. **Perception-Cognition framework**: Two-pillar approach — perception (grounding itemic tokens in semantics) + cognition (structured CoT for reasoning)
2. **R0-R3 Reasoning Hierarchy**: Perception → Derivation → Evolution → Recommendation
3. **Specialize-then-Unify RL**: Domain-focused RL then cross-domain balancing via rejection sampling or multi-teacher distillation
4. **Thinking Supervision Transfer**: CoT training data improves non-thinking mode performance
5. **OneReason-Bench**: Comprehensive reasoning benchmark for recommendation
6. **Open-source**: OneReason-8B and OneReason-0.8B models
## Architecture
Pre-training → SFT (R0-R3 CoT format) → RL (specialize-then-unify) → Deployment

View File

@@ -0,0 +1,30 @@
---
title: "Uncertainty Estimation and Generalization Bounds for Modern Deep Learning (PhD Thesis)"
source_url: https://arxiv.org/abs/2606.13818
ingested: 2026-06-17
sha256: <computed>
---
# Uncertainty Estimation and Generalization Bounds for Modern Deep Learning
**Author:** Luis A. Ortega Andrés — Department of Computer Science, Autonomous University of Madrid
**Supervisor:** Daniel Hernández Lobato
**arXiv:** 2606.13818v1 [cs.LG] (2026-06-11) — PhD Thesis
## Abstract
Investigates how Bayesian principles can deepen understanding of modern deep learning systems. On the methodological side, introduces DVIP (Deep Variational Implicit Process), VaLLA (Variational Linearized Laplace Approximation), and FMGP (Fixed-Mean Gaussian Process). On the theoretical side, develops a unified PAC-Bayesian/large-deviation framework connecting diversity, smoothness, and stochasticity as mechanisms for generalization. Provides quantitative distribution-dependent explanation for double-descent.
## Key Concepts
- [[deep-variational-implicit-process|DVIP]]
- [[variational-linearized-laplace-approximation|VaLLA]]
- [[fixed-mean-gaussian-process|FMGP]]
- [[pac-bayesian-bounds|PAC-Bayesian 界]]
- [[implicit-processes|隐式过程]]
- [[function-space-modeling|函数空间建模]]
- [[generalization-bounds|泛化界]]
- [[double-descent|双下降]]
- [[deep-gaussian-process|深度高斯过程]]

View File

@@ -0,0 +1,22 @@
---
source_url: https://arxiv.org/abs/2604.15097
ingested: 2026-06-14
sha256: skip
---
# From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution
**Authors:** Junjie Wang, Yiming Ren, Haoyang Zhang (Tsinghua University, EvoMap)
**arXiv:** 2604.15097v2 [cs.SE, cs.CL]
**Published:** April 2026 (v2: June 2026)
**Type:** Technical Report
## Abstract
This beta technical report asks how reusable experience should be represented so that it can function as effective test-time control and as a substrate for iterative evolution. We study this question in 4,590 controlled trials across 45 scientific code-solving scenarios. We find that documentation-oriented Skill packages provide unstable control: their useful signal is sparse, and expanding a compact experience object into a fuller documentation package often fails to help and can degrade the overall average. We further show that representation itself is a first-order factor. A compact Gene representation yields the strongest overall average, remains competitive under substantial structural perturbations, and outperforms matched-budget Skill fragments, while reattaching documentation-oriented material usually weakens rather than improves it. Beyond one-shot control, we show that Gene is also a better carrier for iterative experience accumulation: attached failure history is more effective in Gene than in Skill or freeform text, editable structure matters beyond content alone, and failure information is most useful when distilled into compact warnings rather than naively appended. On CritPt, gene-evolved systems improve over their paired base models from 9.1% to 18.57% and from 17.7% to 27.14%.
## Repositories
- https://github.com/EvoMap/skill2gep (Skill-to-Gene transformation)
- https://github.com/EvoMap/evolver (Gene evolution engine)
- https://github.com/openclaw/openclaw (Host runtime)

View File

@@ -0,0 +1,29 @@
---
title: "Weighted Universal Approximation of Differentiable Maps on Infinite-Dimensional Manifolds"
source_url: https://arxiv.org/abs/2606.09820
ingested: 2026-06-17
sha256: <computed>
---
# Weighted Universal Approximation of Differentiable Maps on Infinite-Dimensional Manifolds
**Authors:** Philipp Schmocker, Josef Teichmann
**arXiv:** 2606.09820v1 [math.FA] (2026-06-08)
**Keywords:** Machine learning, neural operator, Universal approximation, weighted approximation, infinite-dimensional manifold, locally convex topological vector space, bounded approximation property, Stone-Weierstrass theorem, Nachbin theorem, Tauberian theorem, non-anticipative functional, rough path, signature.
## Abstract
Generalizes the universal approximation theorem for functional input neural networks (FNN) to differentiable maps by including the approximation of the derivatives. A FNN maps the input from a possibly infinite-dimensional weighted manifold to the real-valued hidden layer, on which a non-linear scalar activation function is applied, and then returns the output into a Banach space via some linear readouts. By proving a weighted Nachbin theorem, establishes a universal approximation theorem (UAT) for differentiable maps, which goes beyond the usual formulation on compact sets and also includes the approximation of the derivatives. Leads to approximation results for non-anticipative functionals including the horizontal and vertical derivatives. Also shows that linear functions of the signature are able to approximate path space functionals including their directional derivatives.
## Key Concepts
- [[functional-input-neural-networks|函数输入神经网络 (FNN)]]
- [[universal-approximation-theorem|通用逼近定理 (UAT)]]
- [[nachbin-theorem|Nachbin 定理]] / [[weighted-spaces|加权空间]]
- [[infinite-dimensional-manifolds|无限维流形]]
- [[bastiani-calculus|Bastiani 微积分]]
- [[non-anticipative-functionals|非预期泛函]]
- [[signature|签名 (Signature)]]
- [[rough-path-theory|粗糙路径理论]]

View File

@@ -0,0 +1,30 @@
---
title: "Dead Directions: Geometric Singular Learning"
source: "arXiv:2606.05957v1"
authors: "Tejas Pradeep Shirodkar"
affiliation: "IIIT Hyderabad"
year: 2026
category: "cs.LG, stat.ML"
published: "2026-06-04"
pages: 139
---
# Dead Directions: Geometric Singular Learning
**Author**: Tejas Pradeep Shirodkar (IIIT Hyderabad)
**arXiv**: 2606.05957v1 [cs.LG, stat.ML]
**Published**: 2026-06-04 | 139 pages
## Abstract
Bridges singular learning theory and information geometry through one primitive: the **dead direction** — a unit vector where the Fisher metric degenerates, with KL order recoverable from directional Fisher curvature decay rate in original coordinates (no Hironaka resolution). Lifts to deep networks via K-FAC factorization, constructs DDCAdam optimizer, and enables readout of Watanabe's triple (lambda, m, nu) from a single checkpoint.
## Key Concepts
- [[dead-direction|Dead Direction]] — core primitive bridging SLT and info geometry
- [[singular-learning-theory|Singular Learning Theory]] — Watanabe's framework
- [[information-geometry|Information Geometry]] — Amari's framework
- [[fisher-information-metric|Fisher Information Metric]] — the geometry object
- [[real-log-canonical-threshold|RLCT (lambda)]] — Watanabe's Bayesian invariant
- [[kl-order|KL Order]] — the bridge invariant
- [[watanabe-triple|Watanabe's Triple]] — (lambda, m, nu)
- [[ddcadam|DDCAdam]] — G-equivariant Adam-family preconditioner

View File

@@ -0,0 +1,35 @@
---
title: "From Ticks to Flows: Dynamics of Neural Reinforcement Learning in Continuous Environments"
source_url: https://arxiv.org/abs/2606.04275
ingested: 2026-06-17
sha256: <computed>
---
# From Ticks to Flows: Dynamics of Neural Reinforcement Learning in Continuous Environments
**Authors:** Saket Tiwari, Tejas Kotwal, George Konidaris — Brown University, Dept. of Computer Science & Applied Mathematics
**Published:** ICLR 2026
**arXiv:** 2606.04275v1 [cs.LG] (2026-06-02)
## Abstract
A novel theoretical framework for deep RL in continuous environments, modeling the problem as a continuous-time stochastic process drawing on stochastic control. Introduces a viable model of actor-critic that incorporates both exploration and stochastic transitions. For single-hidden-layer neural networks, the state of the environment can be formulated as a two time-scale process (environment time + gradient time). Using stochastic differential equations, derives — for the first time in continuous RL — an equation describing the infinitesimal change in state distribution at each gradient step under vanishingly small learning rate. Empirically corroborated on a toy LQR continuous control task.
## Key Concepts
- [[continuous-time-rl|连续时间强化学习]] / [[stochastic-differential-equation|随机微分方程]]
- [[wiener-process|维纳过程]] / [[ito-calculus|Itô 微积分]]
- [[two-time-scale-process|双时间尺度过程]] (environment time + gradient time)
- [[exploratory-dynamics|探索动力学]] — SDE with policy + environment noise
- [[linearized-neural-network|线性化神经网络]] / [[neural-tangent-kernel|NTK]] / [[infinite-width-limit|无限宽度极限]]
- [[martingale-clt|鞅中心极限定理]] / [[control-affine-mdp|控制仿射 MDP]]
- [[linear-quadratic-regulator|LQR]]
## Key Results
- Closed system of only 5 time-dependent variables describing one-step gradient change
- First equation for gradient-time evolution of state distribution under vanishing step size for NNs
- Nonparametric formulation bridging stochastic control and over-parameterized RL
- Exploratory dynamics outperforms additive Wiener noise in state-action coverage

View File

@@ -0,0 +1,22 @@
---
source_url: https://arxiv.org/abs/2605.22166
ingested: 2026-06-11
sha256: placeholder
---
# Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents
**Authors:** Tianshi Xu†, Huifeng Wen†, Meng Li (Peking University) — †Equal contribution
**arXiv:** 2605.22166v2 [cs.AI] — May 2026
**Code:** https://github.com/Tianshi-Xu/Life-Harness
## Abstract
LLM agents are shaped not only by their language models, but also by the runtime harness that mediates observation, tool use, action execution, feedback interpretation, and trajectory control. While existing agent adaptation methods mainly update model parameters, many failures in deterministic, rule-governed domains stem from mismatches at the modelenvironment interface. We propose Life-Harness, a lifecycle-aware runtime harness that improves frozen LLM agents without changing model weights or evaluation environments. Life-Harness evolves from training trajectories by converting recurring interaction failures into reusable interventions across environment contracts, procedural skills, action realization, and trajectory regulation, and remains fixed for evaluation on unseen tasks. On seven deterministic environments from τ-bench, τ²-bench, and AgentBench, Life-Harness improves 116 out of 126 modelenvironment settings across 18 model backbones, with an average relative improvement of 88.5%. Harnesses evolved only from Qwen3-4B-Instruct trajectories transfer to 17 other models, showing that Life-Harness captures reusable environment-side structure rather than model-specific behavior.
## Key Contributions
1. Formulation of harness-based runtime interface adaptation for deterministic LLM agents
2. Life-Harness: a lifecycle-aware framework with four intervention layers
3. Cross-model transfer: harnesses evolved on one model (Qwen3-4B) generalize to 17 others
4. Complementary to model training: enables Qwen2.5-32B to outperform its tool-use-trained derivative

View File

@@ -0,0 +1,33 @@
---
source_url: https://arxiv.org/abs/2602.02343
ingested: 2026-06-01
sha256: raw-from-pdf
---
# Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics
**Authors:** Ziwen Xu¹², Chenyan Wu¹, Hengyu Sun¹, Haiwen Hong²*, Mengru Wang¹, Yunzhi Yao¹, Longtao Huang², Hui Xue², Shumin Deng¹, Zhixuan Chu¹, Huajun Chen¹, Ningyu Zhang¹*
**Affiliations:** ¹Zhejiang University, ²Alibaba Group
**arXiv:** 2602.02343 (v3, 12 Apr 2026)
**Code:** https://github.com/zjunlp/EasyEdit/blob/main/examples/SPLIT.md
## Abstract
Methods for controlling large language models (LLMs), including local weight fine-tuning, LoRA-based adaptation, and activation-based interventions, are often studied in isolation, obscuring their connections and making comparison difficult. In this work, we present a unified view that frames these interventions as dynamic weight updates induced by a control signal, placing them within a single conceptual framework. Building on this view, we propose a unified preference-utility analysis. This analysis separates control effects into two components: preference, defined as the tendency toward a target concept, and utility, defined as coherent and task-valid generation. Both components are measured on a shared log-odds scale using polarity-paired contrastive examples. Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility. We further explain this behavior through an activation manifold perspective, in which control shifts representations along target-concept directions to enhance preference, while utility declines primarily when interventions push representations off the model's valid-generation manifold. Finally, we introduce a new steering approach SPLIT guided by this analysis that improves preference while better preserving utility.
## Key Contributions
1. **Unified View** — casts local weight fine-tuning, LoRA, and activation steering as dynamic weight updates: `h_{i+1} = (W + m₁ΔW)h_i + (b + m₂Δb)`
2. **PreferenceUtility Analysis** — decomposes control into preference (target concept alignment) and utility (task validity) on a shared log-odds scale
3. **Activation Manifold Hypothesis** — explains the preferenceutility trade-off: steering pushes representations off the training-induced activation manifold, causing utility degradation
4. **Three-Stage Preference Dynamics** — Linear Region → Transitional Region → Convergence Region as steering factor m varies
5. **SPLIT Method** — Steering with Preference-UtiLity IntervenTion, a training objective that jointly optimizes preference and utility
## Experimental Setup
- Models: Gemma-2-9B-IT, Qwen-2.5-7B-Instruct
- Tasks: Psychopathy, PowerSeeking, AxBench (top 10 concepts)
- Intervention forms: Local Weight, LoRA, Vector (DiffMean/SFT/RePS)
- Curve fitting R² > 0.95 across most settings

View File

@@ -14,7 +14,7 @@ tags: ["agent", "skill", "optimization", "text-space", "self-evolving"]
**Authors:** Yifan Yang*, Ziyang Gong*, Weiquan Huang*, Qihao Yang*, Ziwei Zhou*, Zisu Huang*, Yan Li, Xuemei Gao, Qi Dai, Bei Liu, Kai Qiu, Yuqing Yang, Dongdong Chen, Xue Yang, Chong Luo (* equal contribution)
**Affiliation:** Microsoft, SJTU, Tongji, Fudan
**arXiv:** [2605.23904](https://arxiv.org/abs/2605.23904) (v2, 25 May 2026)
**Code:** https://aka.ms/SkillOpt
**Code:** https://github.com/microsoft/SkillOpt (MIT License, 3.7k stars)
## Abstract

View File

@@ -0,0 +1,30 @@
---
title: "A Geometric View for Understanding Concept Learning and Neuron Interpretation in Sparse Autoencoders"
source_url: https://arxiv.org/abs/2606.07007
ingested: 2026-06-17
sha256: <computed>
---
# A Geometric View for Understanding Concept Learning and Neuron Interpretation in Sparse Autoencoders
**Authors:** Chenhao Zhang, Chris Lin, Su-In Lee — University of Washington, Paul G. Allen School of CSE
**arXiv:** 2606.07007v1 [cs.LG] (2026-06-05)
**Published:** Preprint, June 8, 2026
## Abstract
A unified mathematical framework for geometric understanding of concept learning and neuron interpretation in sparse autoencoders (SAEs). Formalizes concepts as sets of data points and casts concept learning as a set-alignment problem between human-defined and model-induced concepts. Distinguishes three increasingly strong notions of learning — detection, separation, and approximation — and yields geometric conditions, error bounds, and capacity constraints. Provides a set-theoretic account for SAE phenomena including feature splitting, feature absorption, feature families, and hierarchical concepts. Connects concept learning and neuron interpretation through formal concept analysis, showing that the two directions need not agree and their many-to-many structure can be organized by concept lattices.
## Key Concepts
- [[sparse-autoencoder|稀疏自编码器]] / [[polysemanticity|多义性]]
- [[mechanistic-interpretability|机制可解释性]]
- [[concept-learning|概念学习(几何)]] / [[formal-concept-analysis|形式概念分析]]
- [[feature-splitting|特征分裂]] / [[feature-absorption|特征吸收]] / [[feature-family|特征家族]]
- [[absolute-gating|绝对门控 vs 相对门控]]
- [[hyperplane-arrangements|超平面排列]]
- [[concept-lattice|概念格]]
- [[superposition|叠加]]
- [[linear-representation-hypothesis|线性表征假设]]

View File

@@ -0,0 +1,64 @@
---
title: "Reconciling Contradictory Views on the Effectiveness of SFT in LLMs: An Interaction Perspective"
created: 2026-06-03
updated: 2026-06-03
type: raw-paper
arxiv_id: "2605.17967"
authors:
- "Junpeng Zhang"
- "Lei Cheng"
- "Guoxi Zhang"
- "Hua Cai"
- "Qing Xu"
- "Quanshi Zhang"
affiliations:
- "Shanghai Jiao Tong University"
- "Beijing Institute for General Artificial Intelligence"
- "UniDT"
published: "2026-05-18"
venue: "arXiv preprint"
primary_category: "cs.AI"
source: "https://arxiv.org/abs/2605.17967"
code: null
tags: [SFT, interactions, LLM, fine-tuning, interpretability, overfitting, early-stopping]
---
# Reconciling Contradictory Views on the Effectiveness of SFT in LLMs: An Interaction Perspective
**Authors**: Junpeng Zhang, Lei Cheng, Guoxi Zhang, Hua Cai, Qing Xu, Quanshi Zhang (Shanghai Jiao Tong University, BIGAI, UniDT)
**arXiv**: 2605.17967 | **Published**: 2026-05-18 | **Category**: cs.AI
## Abstract
This paper explores a scientific question in supervised fine-tuning (SFT): why SFT is broadly effective for small-scale deep neural networks, yet can produce inconsistent or even detrimental effects when applied to large language models (LLMs). Recent advances in interaction-based explanations suggest that interactions between words/tokens provide a faithful metric for quantifying the inference patterns encoded by LLMs. The authors find that the evolution of interactions during SFT can effectively explain the inconsistent effectiveness of SFT for LLMs. Specifically: (1) SFT primarily removes noise-like interactions, while rarely acquiring reliable new interactions. (2) This denoising stage is extremely brief, after which continued fine-tuning tends to introduce overfitted interactions. These findings are validated across multiple LLMs and datasets.
## Key Concepts
- **Interaction-based explanation**: Decomposing LLM inference patterns into AND-OR interactions between input tokens
- **Three interaction types**: Removed (eliminated during SFT), Preserved (retained throughout), Newly emerged (acquired during SFT)
- **Two-stage SFT dynamics**: Brief denoising stage (~1000 steps) → prolonged overfitting stage
- **Interaction quality metrics**: Generalizability (γ) and uncancelled-effect ratio (ρ)
- **Preserved interactions as inference backbone**: A small set of low-order, generalizable interactions supports the majority of token prediction
## Experimental Setup
- **Models**: Qwen2.5-3B-Instruct, Qwen2.5-7B-Instruct, Llama-2-7B-Chat, Llama-3-8B-Instruct, Gemma-3-4B-it
- **Datasets**: GoEmotions, Unilaw-R1-Data, Databricks-Dolly-15k
- **Method**: LoRA fine-tuning, interaction extraction via AND-OR decomposition
- **GPUs**: 8× NVIDIA Tesla V100-PCIE-32GB
## Five Core Findings
1. LLMs learn only a few newly emerged interactions in the first (denoising) stage, but many in the second (overfitting) stage
2. Early-emerged interactions are more generalizable; later-emerged interactions behave like noise
3. Interaction removal occurs primarily within the very short first stage
4. Removed interactions are predominantly noise: high-order, non-generalizable, mutually canceling
5. Preserved interactions (small set, low-order) exhibit high generalizability and weak cancellation — they form the backbone of LLM inference
## Practical Implications
- SFT is effective but its useful regime is surprisingly short
- Interactions can serve as diagnostic signals for monitoring SFT progress
- Provides a principled criterion for early stopping in end-to-end SFT
- Challenges the belief that fine-tuning on massive datasets is necessarily beneficial

View File

@@ -0,0 +1,39 @@
---
title: "TARPO: Token-Wise Latent-Explicit Reasoning via Action-Routing Policy Optimization"
source_url: https://arxiv.org/abs/2606.05859
ingested: 2026-06-17
sha256: <computed>
---
# TARPO: Token-Wise Latent-Explicit Reasoning via Action-Routing Policy Optimization
**Authors:** Liting Zhang, Shiwan Zhao, Xuyang Zhao, Zichen Xu, Jianye Wang, Qicheng Li (TMCC, College of Computer Science, Nankai University, Tianjin)
**arXiv:** 2606.05859v1 [cs.CL] (2026-06-04)
**Code:** https://github.com/NKU-LITI/TARPO-master
## Abstract
Latent reasoning has emerged as a promising alternative to discrete Chain-of-Thought (CoT) in large language models (LLMs), enabling more expressive reasoning by operating over continuous representations. However, the inherently deterministic nature of continuous representations limits policy exploration in reinforcement learning (RL). To address this, we propose TARPO (Token-Wise Latent-Explicit Reasoning via Action-Routing Policy Optimization), a pure RL framework that adaptively switches between discrete token generation and continuous latent reasoning at each step. TARPO introduces a lightweight action head router that observes the current hidden state and samples a routing decision from a binary mode-selection space, preserving the stochasticity of discrete token sampling from the vocabulary. The LLM backbone and router are jointly optimized end-to-end with a shared group-relative advantage signal. Extensive experiments across Qwen2.5 (from 1.5B to 7B) and Llama-3.1-8B backbones demonstrate that TARPO consistently outperforms existing explicit and latent reasoning RL baselines across diverse benchmarks. Further analysis shows that TARPO learns adaptive token-wise switching behaviors while maintaining stable training dynamics.
## Key Concepts
- [[latent-reasoning|潜在推理]] vs [[chain-of-thought|思维链]]
- [[continuous-representation|连续表征]]
- [[soft-token]] / [[hard-token]]
- [[action-routing-policy|动作路由策略]]
- [[action-head-router|动作头路由器]]
- [[token-wise-routing|逐token路由]]
- [[hybrid-reasoning|混合推理]]
- [[grpo|GRPO]]
- [[coconut|COCONUT]]
- [[hrpo|HRPO]]
## Key Findings
- TARPO achieves superior in-domain performance across Qwen2.5 (1.5B, 3B, 7B), improving GRPO by 0.52% Pass@1 and 1.22% Pass@32 on average
- Out-of-distribution generalization: 4.76% improvement on HumanEval over GRPO, with 18% fewer generated tokens
- Cross-architecture generalization verified on Llama-3.1-8B
- Adaptive switching behavior: router learns to select soft tokens for key mathematical operations while using hard tokens for structural text
- Action head bias initialization and KL penalty are critical hyperparameters for stable training

View File

@@ -0,0 +1,41 @@
---
source_url: https://arxiv.org/abs/2606.12344v1
ingested: 2026-06-15
arxiv_id: 2606.12344v1
sha256: TBD
---
# Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks
**Authors:** Mengyu Zheng, Kai Han, Boxun Li, Haiyang Xu, Yuchuan Tian, Wei He, Hang Zhou, Jianyuan Guo, Hailin Hu, Lin Ma, Chao Xu, Guohao Dai, Lixue Xia, Yunchao Wei, Yunhe Wang, Yu Wang
**Affiliations:** TokenRhythm Technologies, Infinigence AI, City University of Hong Kong, SEE Fund, Peking University, Shanghai Jiaotong University, Beijing Jiaotong University, Tsinghua University
**arXiv:** 2606.12344v1 | **Date:** 2026-06-10 | **Categories:** cs.LG, cs.CL
**Resources:** https://github.com/opensquilla/claw-swe-bench | https://huggingface.co/datasets/TokenRhythm/Claw-SWE-Bench
## Abstract
General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring. We introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol that makes heterogeneous agent harnesses, or claws, comparable under fair settings including a fixed prompt, runtime budget, workspace contract, patch extraction procedure, and evaluator.
The full benchmark contains 350 GitHub issue-resolution instances across 8 languages and 43 repositories, drawn from SWE-bench-Multilingual and SWE-bench-Verified-Mini after future-commit cleanup. We also release Claw-SWE-Bench Lite for faster validation, which is an 80-instance subset selected by a cost-aware, rank-aware procedure over 17 calibration columns.
Key findings:
- OpenClaw with minimal direct-diff adapter: 19.1% Pass@1
- OpenClaw with full adapter: 73.4% Pass@1 (same GLM 5.1 backbone)
- Model choice changes Pass@1 by 29.4 pp; harness choice by 27.4 pp
- Systems with similar accuracy can differ substantially in total API cost
- Claw-SWE-Bench treats harness and cost accounting as first-class evaluation axes
## Key Concepts
- Agent harness (claw) as controlled experimental variable
- Adapter protocol: lifecycle methods (create_agent, send_task, backup_session, delete_agent, get_docker_args)
- Full adapter vs bare adapter design
- Cost-aware benchmarking: Pass@1 + total API cost + wall-clock duration + cache hit rate
- Pareto frontier of accuracy vs cost
- Claw-SWE-Bench Lite: 80-instance cost-aware rank-aware subset
- Future-commit cleanup for fair evaluation
- Patch-based evaluation contract (git diff from /testbed)
- Harness × model interaction effects