20260625:很多新内容

This commit is contained in:
2026-06-25 14:08:47 +08:00
parent 91fac5b6fc
commit 6021dea160
375 changed files with 19263 additions and 251 deletions

View File

@@ -0,0 +1,38 @@
---
title: "Arbor: Toward Generalist Autonomous Research via Hypothesis-Tree Refinement"
author: "Jiajie Jin†‡, Yuyang Hu†, Kai Qiu, Qi Dai, Chong Luo, Guanting Dong, Xiaoxi Li, Tong Zhao, Xiaolong Ma, Gongrui Zhang, Zhirong Wu, Bei Liu, Zhengyuan Yang, Linjie Li, Lijuan Wang, Hongjin Qian, Yutao Zhu, Zhicheng Dou*"
source: "arXiv 2606.11926v1"
date: "2026-06-10"
type: paper
venue: "arXiv (cs.CL, cs.AI)"
tags: ["autonomous-research", "agent", "hypothesis-tree", "coordinator-executor", "ao"]
code: "https://github.com/RUC-NLPIR/Arbor"
---
# Arbor: Autonomous Research via Hypothesis-Tree Refinement
> Jin†‡, Hu†, Qiu, Dai, Luo, Dong, Li, Zhao, Ma, Zhang, Wu, Liu, Yang, Li, Wang, Qian, Zhu, Dou*
> Renmin University / Microsoft Research | arXiv:2606.11926v1 | Jun 2026
## 核心问题
如何让 AI Agent 在长程自主科研中运行探索-实验-抽象循环?科学进步依赖反复的方向测试、证据解读和经验传承,但现有 Agent 将这些视为独立的局部尝试而非累积过程。
## 核心框架Hypothesis Tree Refinement (HTR)
Arbor 将自主科研建模为 **Autonomous Optimization (AO)**——Agent 通过迭代实验改进初始研究产物,无需步骤级人工监督。核心状态是一个持久化的假设树:
### 树的节点 = 研究单元 ⟨h, ι, µ⟩
- **h (Hypothesis)**:可验证/可证伪的改进主张
- **ι (Insight)**:可复用的证据解读——不是执行日志,是紧凑语义记忆
- **µ (Metadata)**状态、分数、git branch/commit 引用
### Coordinator ↔ Executor 双角色
- **Coordinator**(长生命周期):拥有全局树,管理搜索前沿、选择方向、传播洞察、决定合并/剪枝
- **Executor**(短生命周期,隔离 worktree实现并测试单个假设返回结构化报告
## 关键结果
- 6 项真实科研任务(模型训练/Harness 工程/数据合成):全部最优 held-out 结果
- vs Codex/Claude Code**平均 2.5×** 相对 held-out 增益
- MLE-Bench Lite (GPT-5.5)**86.36%** Any Medal

View File

@@ -0,0 +1,29 @@
---
title: "NANO Filter 原始存档"
created: 2026-06-22
type: raw
arxiv: "2410.15832"
source: "https://arxiv.org/abs/2410.15832"
---
# Nonlinear Bayesian Filtering with Natural Gradient Gaussian Approximation
- **作者**: Wenhan Cao, Tianyi Zhang, Zeju Sun, Chang Liu, Stephen S.-T. Yau, Shengbo Eben Li
- **机构**: 清华大学车辆与运载学院、数学科学系、北京大学工学院、BIMSA
- **arXiv**: 2410.15832 [eess.SY]
- **提交**: 2024-10-21 | 最新版本 v4: 2026-03-15
- **DOI**: https://doi.org/10.48550/arXiv.2410.15832
## 摘要
Practical Bayes filters often assume the state distribution of each time step to be Gaussian for computational tractability, resulting in the so-called Gaussian filters. When facing nonlinear systems, Gaussian filters such as extended Kalman filter (EKF) or unscented Kalman filter (UKF) typically rely on certain linearization techniques, which can introduce large estimation errors. To address this issue, this paper reconstructs the prediction and update steps of Gaussian filtering as solutions to two distinct optimization problems, whose optimal conditions are found to have analytical forms from Stein's lemma. It is observed that the stationary point for the prediction step requires calculating the first two moments of the prior distribution, which is equivalent to that step in existing moment-matching filters. In the update step, instead of linearizing the model to approximate the stationary points, we propose an iterative approach to directly minimize the update step's objective to avoid linearization errors. For the purpose of performing the steepest descent on the Gaussian manifold, we derive its natural gradient that leverages Fisher information matrix to adjust the gradient direction, accounting for the curvature of the parameter space. Combining this update step with moment matching in the prediction step, we introduce a new iterative filter for nonlinear systems called **N**atural Gr**a**dient Gaussia**n** Appr**o**ximation filter, or NANO filter for short. We prove that NANO filter locally converges to the optimal Gaussian approximation at each time step. Furthermore, the estimation error is proven exponentially bounded for nearly linear measurement equation and low noise levels through constructing a supermartingale-like property across consecutive time steps.
## 关键概念
- Natural gradient descent on Gaussian manifold
- Fisher information matrix
- Moment matching (prediction step)
- Stein's lemma for optimality conditions
- Gibbs posterior for robustness
- Pseudo-Huber loss for outlier handling
- Convergence proof & exponential error bound

View File

@@ -0,0 +1,33 @@
---
title: "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality"
source: arXiv
source_id: 2405.21060
authors:
- Tri Dao (Princeton University)
- Albert Gu (Carnegie Mellon University)
published: 2024-05-31
venue: ICML 2024
categories:
- cs.LG
---
# Transformers are SSMs
## Abstract
While Transformers dominate language modeling, state-space models (SSMs) such as Mamba have matched or outperformed them at small-to-medium scale. This paper shows these model families are closely related through **structured state space duality (SSD)**, connected via **semiseparable matrices**. The SSD framework enables Mamba-2, a refined selective SSM that is 2-8x faster than Mamba while competitive with Transformers.
## Core Contributions
1. **SSD Framework**: Equivalence between SSMs and semiseparable matrices → connects SSM recurrence with attention-like quadratic forms
2. **Structured Masked Attention (SMA)**: Generalizes linear attention with data-dependent position masks
3. **SSD Algorithm**: Block decomposition of semiseparable matrices, leveraging both linear (recurrent) and quadratic (attention-like) forms
4. **Mamba-2 Architecture**: Multi-head SSM design with tensor parallelism support
5. **Systems Optimizations**: TP, sequence parallelism, variable-length training
## Key Concepts
- Structured State Space Duality (SSD), Semiseparable Matrices
- Structured Masked Attention (SMA), Linear Attention
- Selective SSMs, Scalar SSM, Head Structure for SSMs (MIS/MVA/GVA)
- SSD Algorithm, Block Decomposition, Tensor Contraction Duality
## URL
https://arxiv.org/abs/2405.21060

View File

@@ -0,0 +1,32 @@
---
title: "Engram: Conditional Memory via Scalable Lookup (Raw Archive)"
created: 2026-06-25
updated: 2026-06-25
type: raw
tags: ["conditional-memory", "sparsity", "ngram", "mixture-of-experts"]
source: "https://arxiv.org/abs/2601.07372"
---
# Engram: Conditional Memory via Scalable Lookup — Raw Archive
## Metadata
- **Title**: Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models
- **Authors**: Xin Cheng, Wangding Zeng, Damai Dai, Qinyu Chen, Bingxuan Wang, Zhenda Xie, Kezhao Huang, Xingkai Yu, Zhewen Hao, Yukun Li, Han Zhang, Huishuai Zhang, Dongyan Zhao, Wenfeng Liang
- **Affiliations**: Peking University, DeepSeek-AI
- **arXiv**: 2601.07372
- **Date**: 2026-01-12
- **Categories**: cs.CL, cs.AI
- **Code**: https://github.com/deepseek-ai/Engram
## Abstract
While Mixture-of-Experts (MoE) scales capacity via conditional computation, Transformers lack a native primitive for knowledge lookup, forcing them to inefficiently simulate retrieval through computation. To address this, we introduce conditional memory as a complementary sparsity axis, instantiated via Engram, a module that modernizes classic N-gram embedding for O(1) lookup. By formulating the Sparsity Allocation problem, we uncover a U-shaped scaling law that optimizes the trade-off between neural computation (MoE) and static memory (Engram). Guided by this law, we scale Engram to 27B parameters, achieving superior performance over a strictly iso-parameter and iso-FLOPs MoE baseline. Most notably, while the memory module is expected to aid knowledge retrieval (e.g., MMLU +3.4; CMMLU +4.0), we observe even larger gains in general reasoning (e.g., BBH +5.0; ARC-Challenge +3.7) and code/math domains (HumanEval +3.0; MATH +2.4). Mechanistic analyses reveal that Engram relieves the backbone's early layers from static reconstruction, effectively deepening the network for complex reasoning. Furthermore, by delegating local dependencies to lookups, it frees up attention capacity for global context, substantially boosting long-context retrieval (e.g., Multi-Query NIAH: 84.2 to 97.0).
## Key Contributions
1. Conditional memory as a new sparsity axis complementary to MoE
2. Engram module: modernized N-gram embedding with multi-head hashing, context-aware gating, depthwise convolution
3. Sparsity Allocation problem and U-shaped scaling law
4. Infrastructure-aware design: deterministic addressing enables host memory prefetching
5. Empirical validation at 27B-40B scale with comprehensive ablation

View File

@@ -0,0 +1,56 @@
---
title: "MCP-Zero: Active Tool Discovery for Autonomous LLM Agents"
created: 2026-06-19
updated: 2026-06-19
type: paper-raw
source: https://arxiv.org/abs/2506.01056
arxiv_id: 2506.01056
version: v4
---
# MCP-Zero: Active Tool Discovery for Autonomous LLM Agents
**Authors**: Xiang Fei, Xiawu Zheng*, Hao Feng (Xiamen University, USTC)
**Published**: 2025-06-01 (v4: 2025-06-24)
**Venue**: arXiv:2506.01056 (cs.AI, cs.SE)
**Code**: https://github.com/xfey/MCP-Zero
## 核心洞察
当前 LLM Agent 的工具使用是**被动的**——将所有 tool schema 注入 system prompt 让模型从中选择。这有两个致命问题:(1) 上下文开销爆炸GitHub MCP server 一个就需要 4600+ tokens全生态 248K tokens(2) 决策自主权被剥夺——模型从"自主能力构建者"退化为"被动选择器"。
MCP-Zero 将范式翻转为**主动工具发现Active Tool Discovery**Agent 自主识别能力缺口,按需生成结构化工具请求,系统匹配并返回。
## 三大机制
### 1. Active Tool Request
模型自主生成结构化请求:
```
<tool_assistant>
server: File system allowing file operations
tool: Read file by filename
</tool_assistant>
```
关键:请求在**工具文档的语义空间**中,语义对齐度高于原始用户查询。
### 2. Hierarchical Semantic Routing
两级粗到细检索:
- 第一级server 字段 → 匹配 server 描述(含增强摘要)
- 第二级tool 字段 → 在选中的 server 内排序
- 评分score = (s_server × s_tool) × max(s_server, s_tool)
- 复杂度从 O(n) 降至 O(m+k)m+k ≪ n
### 3. Iterative Capability Extension
支持多轮迭代发现:模型可逐步构建跨域 toolchain文件→编辑→执行当前工具不足时可优化请求重新检索。
## 关键数据
- 数据集 MCP-tools308 servers, 2,797 tools
- APIBank 上 token 消耗降低 **98%** 且保持高准确率
- 在 248.1K tokens 的工具描述空间中精准选择
## 理论分析
- 主动发现建模为 active learningr* = arg max I(T*; r|s_t)
- 注意力分布:被动 O(1/n) ↘ 主动 O(1/k)k ≪ n
- 语义对齐优势cos(e_r, e_t) > cos(e_q, e_t)

View File

@@ -0,0 +1,36 @@
---
title: "A Bifurcation Theory Framework for Gradient Descent on the Edge of Stability"
created: 2026-06-23
type: paper-raw
arxiv: "2606.15551v1"
category: cs.LG
author: "Eric Gan"
date: 2026-06-14
venue: Preprint
---
# A Bifurcation Theory Framework for Gradient Descent on the Edge of Stability
- **作者**: Eric Gan (Independent Researcher, egan8@ucla.edu)
- **arXiv**: 2606.15551v1
- **领域**: cs.LG (Machine Learning)
- **日期**: 2026-06-14
- **来源**: https://arxiv.org/abs/2606.15551
## 摘要
The Edge of Stability (EoS) phenomenon, where gradient descent operates with sharpness exceeding the classical convergence threshold yet the loss decreases over long timescales, is ubiquitous in modern deep learning but remains poorly understood in realistic settings. Prior rigorous analyses have been largely confined to scalar or low-dimensional losses with specific structural forms. In this work, we develop a bifurcation theory framework for gradient descent on the edge of stability that applies directly to overparameterized neural networks. By decomposing the training dynamics into components normal and tangent to the manifold of minimizers, we show that stable EoS training arises from a flip bifurcation in the normal direction, governed by the sign of the first Lyapunov coefficient, while the tangent dynamics drift toward regions of decreasing sharpness. Under mild spectral and geometric assumptions on the loss landscape, we prove convergence to the minimizing manifold when training at the EoS threshold. As a corollary, we recover and unify prior results: we show that the product-stability condition of Gan (2026) is an instance of our framework.
## 核心贡献
1. 发展了一个适用于过参数化网络的分岔理论 EoS 框架
2. 将 EoS 动力学分解为法向 flip 分岔 + 切向 sharpness 递减漂移
3. 证明了在 EoS 阈值处(η = 2/λ_max收敛到极小值流形 (Theorem 4.4)
4. 统一了乘积稳定性 (Gan 2026) 为框架特例
## 关键技术工具
- 中心流形定理 (Center Manifold Theorem)
- 投影法 (Projection Method)
- 第一 Lyapunov 系数 (c₁)
- Morse-Bott 条件 + 谱间隙假设

View File

@@ -0,0 +1,39 @@
---
title: "Thinking-Based Non-Thinking: Solving the Reward Hacking Problem in Training Hybrid Reasoning Models via Reinforcement Learning"
source: arXiv
source_id: 2601.04805
authors:
- Siyuan Gan (Nanjing University)
- Jiaheng Liu (Nanjing University)
- Boyan Wang (Nanjing University)
- Tianpei Yang (Nanjing University)
- Runqing Miao (Jiutian Research)
- Yuyao Zhang (Jiutian Research)
- Fanyu Meng (Jiutian Research)
- Junlan Feng (Jiutian Research)
- Linjian Meng (Shanghai AI Laboratory)
- Jing Huo (Nanjing University)
- Yang Gao (Nanjing University)
published: 2026-01-08
updated: 2026-06-07
categories:
- cs.AI
venue: Preprint
---
# Thinking-Based Non-Thinking (TNT)
## Abstract
Large reasoning models (LRMs) achieve exceptional performance via long Chain-of-Thought (thinking), causing substantial computational overhead — the overthinking problem. RL-trained hybrid reasoning models that dynamically choose thinking/non-thinking modes suffer from **reward hacking**: the model generates thinking-like responses while being classified as non-thinking, receiving undeserved rewards.
Existing mitigations: (1) SFT with large datasets (high cost), or (2) uniform token limits on non-thinking (ineffective for varied query difficulties). TNT proposes **per-query dynamic token limits** derived from the thinking mode's solution length — leveraging the fact that LRMs' thinking mode ensures its solution component contains no additional thinking.
## Core Contributions
1. **TNT (Thinking-Based Non-Thinking)**: Dynamic per-query maximum token usage for non-thinking mode, derived from the solution component of thinking mode responses
2. **50% token reduction** vs DeepSeek-R1-Distill-Qwen while **improving accuracy** across 5 math benchmarks
3. **Optimal accuracy-efficiency trade-off** among all tested hybrid reasoning methods
4. **<10% reward hacking rate** across all datasets
5. Compatible with any RL algorithm (GRPO, PPO, DAPO, Dr.GRPO, GSPO)
## URL
https://arxiv.org/abs/2601.04805

View File

@@ -0,0 +1,53 @@
---
title: "Dynamic ReAct: Scalable Tool Selection for Large-Scale MCP Environments"
created: 2026-06-19
updated: 2026-06-19
type: paper-raw
source: https://arxiv.org/abs/2509.20386
arxiv_id: 2509.20386
version: v1
---
# Dynamic ReAct: Scalable Tool Selection for Large-Scale MCP Environments
**Authors**: Nishant Gaurav, Adit Akarsh, Ankit Ranjan, Manoj Bajaj (agentr.dev)
**Published**: 2025-09-22
**Venue**: arXiv:2509.20386 (cs.SE, cs.AI, cs.IR)
## 核心问题
当 MCP 工具生态扩展到数百到数千个工具时,传统 ReAct Agent 的全量加载方式不可行——LLM 上下文有硬限制。
## 五架构演进
### 1. Baseline: Direct Semantic Search
用户查询直接入向量库 → 取 top-k → 绑定 LLM。简单但噪声严重"退订链接"查询返回 Mailchimp 的 unsubscribe 报告而非 Gmail 工具)。
### 2. Meta-Tool Query Construction
暴露向量搜索为 meta-toolLLM 先构造原子化搜索查询再检索。更精确,但仍需大 k 值。
### 3. Search and Load★ 最优)
两个 meta-tool`search_tools`两级搜索k1=20→去重→每应用上限 k2=5+ `load_tools`LLM 精选后显式加载)。多查询合并、精确加载 < 5 个工具
### 4. Application-Aware (Hierarchical Search)
增加 `search_apps` 先定位应用再搜工具application filtering 在语义搜索中效果有限——LLM 倾向直接用 query 包含 app
### 5. Fixed Tool Set
四个固定 meta-tool 动态获取工具信息并调用缓存效率好但长对话中性能退化
## 向量检索优化
| 策略 | Top-5 | Top-10 |
|------|-------|--------|
| OpenAI text-embedding-3-large (baseline) | 40% | 64% |
| voyage-context-3 | 48% | 68% |
| **voyage-context-3 + Sonnet context enrichment** | **60%** | 68% |
| + BM25 hybrid | 56% | 72% |
Context enrichment 带来 50% 相对提升Top-5: 4060%)。
## 关键创新
- **default tools**create_table + web_search 始终可用避免为通用任务浪费搜索
- **Meta-tool 作为"七杠杆"**LLM Client (1) + Meta Tools (4) + Tool Registry (1) + Vector DB (1)
- 工具加载减少 **50%**准确率不降

View File

@@ -0,0 +1,94 @@
---
title: "Mamba: Linear-Time Sequence Modeling with Selective State Spaces"
authors: ["Albert Gu", "Tri Dao"]
date: 2023-12-01
arxiv_id: "2312.00752v2"
categories: ["cs.LG", "cs.AI"]
affiliations: ["Carnegie Mellon University", "Princeton University"]
paper_type: "conference"
code: "https://github.com/state-spaces/mamba"
---
# Mamba: Linear-Time Sequence Modeling with Selective State Spaces
## 摘要
Foundation model 几乎全部基于 Transformer 架构,但其注意力机制的二次复杂度在处理长序列时效率极低。各种次二次复杂度架构(线性注意力、门控卷积、结构状态空间模型)试图取代注意力,但在语言等核心模态上始终达不到 Transformer 质量。本文识别出这些模型的根本弱点——**缺乏内容感知推理能力content-based reasoning**——并通过两个关键创新解决:(1) 让 SSM 参数成为输入的函数选择机制S6使模型能根据当前 token 选择性传播或遗忘信息;(2) 设计硬件感知的并行算法,在循环模式下高效计算。最终形成极简架构 Mamba——无注意力甚至无 MLP 块。Mamba 推理吞吐量是 Transformer 的 5 倍,序列长度线性扩展,在语言、音频、基因组学等多个模态达到 SOTA。Mamba-3B 性能超过同规模 Transformer 并匹敌两倍规模的 Transformer。
## 核心贡献
1. **选择机制Selection Mechanism / S6**:将 SSM 参数(Δ, B, C变为输入依赖从时间不变LTI升级为时间变化
2. **硬件感知算法**通过并行关联扫描parallel associative scan在 SRAM 中计算,避免 GPU HBM 之间的 IO 瓶颈
3. **极简架构 Mamba**:将 H3 架构中的 SSM 层与 MLP 门控融合为单一同质块
4. **选择复制Selective Copying和归纳头Induction Heads合成任务**Mamba 不仅轻松解决,且能无限外推(>1M tokens
## 方法框架
### 从 S4 到 S6
传统 S4 的关键局限是 **线性时间不变性LTI**:参数 (Δ, A, B, C) 对所有时间步固定。这意味着状态更新规则不随输入内容改变——模型无法"选择性"关注或忽略特定 token。
Mamba 的选择机制S6将 B, C, Δ 变为输入 x 的函数:
```
B_t = s_B(x_t) # 输入 → 输入投影
C_t = s_C(x_t) # 输入 → 输出投影
Δ_t = τ_Δ(Δ + s_Δ(x_t)) # 输入依赖的步长
```
核心差异:
| 特性 | S4 (LTI) | S6 (Selective) |
|------|---------|---------------|
| 参数 | 时间不变 | 时间变化(输入依赖) |
| 计算模式 | 卷积 OR 循环 | 仅循环(需 scan |
| 选择性 | 无 | 有(过滤/保留) |
| 内容感知 | 否 | 是 |
### 硬件感知并行 Scan
选择机制消除了卷积等价性——模型必须是时间变化的无法用卷积并行计算。Mamba 通过**并行关联扫描parallel associative scan / Blelloch scan**解决:
1. 将状态更新展开为前缀和操作
2. 在 GPU SRAM 中做 kernel fusion避免将扩展状态写入 HBM
3. 输入在 HBM → 加载到 SRAM → scan + 离散化 → 写回 HBM
结果:比所有基于卷积的 SSM 快 3×A100 GPU
### Mamba 架构
```
Input → Mamba Block → ... (×L) → Output
Mamba Block:
x → LayerNorm → [Linear(expand) → Conv1d → SiLU → SSM(S6)] → LayerNorm → Linear → + (residual)
```
关键设计:
- **无注意力、无 MLP**:用选择性 SSM 取代二者
- **扩展因子 E=2**Linear 将 d_model 扩展到 2× 再投影回
- **残差连接 + SiLU 激活**
- **H3 简化**:将 H3 的两个门控 SSM 融合为单一选择性 SSM
## 实验结果
- **合成任务**Selective Copying 和 Induction Heads → Mamba 可以泛化到 >1M token 序列
- **语言建模**Mamba-3B 在 pretraining perplexity 和 0-shot 评估上超过 Pythia-3B匹敌 Pythia-7B5× 推理吞吐
- **音频**:在 SC09 语音生成上将 FID 降低一半以上
- **基因组学**:在 DNA 建模上超过 HyenaDNA 和 Transformer
## 关键概念
- [[selective-state-space]] — S6 选择机制,输入依赖的 SSM 参数化
- [[hardware-aware-algorithm]] — GPU 层次优化的并行 scan
- [[structured-state-space-models]] — S4 前身HiPPO 矩阵 + 对角结构
- [[selective-copy]] — 需要内容感知的选择性复制任务
- [[induction-heads]] — 解释 LLM in-context learning 能力的机制
- [[hippo]] — SSM 的数学基础High-order Polynomial Projection Operators
- [[content-based-reasoning]] — Mamba 识别并解决的核心弱点
## 参考
- 代码https://github.com/state-spaces/mamba
- S4 (Gu et al., 2022)
- H3 (Dao et al., 2023)
- 选择复制任务 (Arjovsky et al., 2016)
- 归纳头 (Olsson et al., 2022)

View File

@@ -0,0 +1,43 @@
---
title: "Dual-Channel Grounded World Modeling (DCGWM)"
source_id: "arXiv:2606.18688v1"
authors:
- "Akshay Hazare"
affiliations: "Independent Researcher"
date: 2026-06-17
categories: ["cs.LG", "cs.AI"]
note: "Position paper. Experimental validation in progress."
url: "https://arxiv.org/abs/2606.18688v1"
---
# Dual-Channel Grounded World Modeling (DCGWM)
**Authors**: Akshay Hazare (Independent)
**arXiv**: 2606.18688v1 | **Date**: 2026-06-17
**Categories**: cs.LG, cs.AI
**Position paper — experimental validation ongoing**
## Abstract
Joint Embedding Predictive Architectures (JEPAs) are a leading approach to world model representation learning. We identify a failure mode in JEPA-based world models grounded against two qualitatively distinct external signals: physical dynamics (sparse, high-magnitude, constraint-satisfying gradient corrections) and social-behavioral dynamics (diffuse, distribution-matching corrections). We term this **Objective Interference Collapse (OIC)**: joint learning in a shared latent space causes the dominant channel to systematically collapse the subordinate channel's representational subspace, in a manner not resolvable by loss weighting alone.
We propose **Dual-Channel Grounded World Modeling (DCGWM)**, designed to structurally prevent OIC through a partitioned latent space (Z_p ⊕ Z_b) with inward-only gradient flow. The Physical Grounding Channel updates only Z_p via VICReg-style alignment; the Social-Behavioral Grounding Channel updates only Z_b via alignment to emergent multi-agent simulation trajectories. An Inter-Channel Interface Module couples subspaces at the task level without cross-subspace gradients. An Asymmetric Grounding Adherence Loss penalizes rollout drift with a hard hinge for physical violations and a soft KL for behavioral divergence. A Generative Rendering Layer is architecturally isolated from the latent world model.
Three theoretical results: the partition removes the gradient-interference pathway; each grounded subspace inherits anti-collapse guarantees; generative isolation is necessary under stated assumptions.
## Key Contributions
1. **Objective Interference Collapse**: Formalization of a new collapse mode — when two grounding signals with incompatible statistical structures share a latent space
2. **DCGWM Architecture**: Partitioned latent space + inward-only gradient flow + separated grounding channels
3. **Asymmetric Grounding Adherence Loss (L_AGA)**: First loss for rollout drift under heterogeneous grounding with incompatible tolerance structures
4. **Isolation Necessity Theorem**: Under assumptions A1-A2, any α > 0 generative gradient causes world model drift
5. **LLM World Modeling Critique**: NTP-trained LLMs face inherent subspace collapse that DCGWM avoids by design
## Key Concepts
- [[objective-interference-collapse|OIC]] — The new collapse mode this paper identifies
- [[dcgwm|DCGWM]] — The architecture
- [[inward-only-gradient-flow|Inward-Only Gradient Flow]] — The key separation mechanism
- [[asymmetric-grounding-adherence-loss|L_AGA]] — Asymmetric rollout drift penalty
- [[rollout-drift|Rollout Drift]] — Multi-step prediction error accumulation
- [[isolation-necessity-theorem|Isolation Necessity]] — Formal generative isolation result

View File

@@ -0,0 +1,71 @@
---
title: "A Collectivist, Economic Perspective on AI"
author: Michael I. Jordan
arxiv_id: "2507.06268"
categories: cs.CY, cs.AI, stat.ML
date: 2025-07-08
updated: 2025-12-15 (v3)
url: https://arxiv.org/abs/2507.06268
type: paper
tags:
- ai-economics
- collective-intelligence
- uncertainty
- mechanism-design
- foundation-models
---
## 摘要
信息技术正处于一场革命之中——无处不在的数据收集和机器学习正以前所未有的方式影响人类世界。"智能"一词被用作技术发展的北极星,人类认知被视作基线。这种观点忽略了人类是社会动物这一事实,我们的大部分智能具有社会和文化起源。前路不是更多的数据和计算,也不是更多关注认知或符号表征,而是**在算法设计层面将经济与社会概念与计算和推断概念深度融合**。
## 核心框架:三种思维方式的融合
Jordan 提出将三种思维方式融合为 AI 系统设计的新基础:
```
计算思维 (Computational) → 模块化、抽象、规模化
推断思维 (Inferential) → 不确定性下的数据收集与预测
经济思维 (Economic) → 激励机制、博弈均衡
```
两两融合已形成学科(如算法博弈论),但三者的完整融合才是目标。论文通过若干案例展示这种融合的具体形态。
## 关键案例
### 1. 数据库设计中的推断思维§2
传统数据库关注计算(隐私保护、查询优化),但**推断思维**引入了不同的视角:不是对标数据库中的已有患者,而是**对来自同一总体的新患者做出预测并量化不确定性**。这需要生成模型、因果推断("what if"问题)。
### 2. 统计合同理论§3
[[statistical-contract-theory|统计合同理论]]Bates et al., 2024将假设检验嵌入经济合同设计。核心发现在顺序博弈中合同是激励相容的当且仅当选项可表达为 **[[e-values|E-values]]**——一种在零假设下期望 ≤1 的函数,可视为证据随时间的累积(非负上鞅)。
### 3. 数据市场§4.2
[[data-markets|三层数据市场]]Fallah et al., 2024用户→平台→第三方数据买家。核心张力平台需要在服务收入来自用户与数据销售收入来自买家之间权衡同时需向用户提供隐私保证来维持参与。需建模为广义 Stackelberg 博弈求均衡。
### 4. 基础模型与预测驱动推断§4.3
AlphaFold 案例:在知识边界(量子涨落蛋白)上给出高置信但完全偏倚的预测。[[prediction-driven-inference|预测驱动推断]]PPI混合少量局部 ground-truth 数据与全局基础模型预测,使置信区间重新覆盖真实值。
### 5. 概率匹配(附录 C
[[probability-matching|概率匹配]]:小鼠迷宫实验——左臂食物是右臂的 2 倍。决策论最优小鼠每次去左边;真实小鼠以 2:1 的概率匹配。在**种群视角**下这是纳什均衡——避免资源浪费,提升社会总福利。这是集体主义不确定性处理的微观范例。
## 教育启示
论文附录 B 讨论了 UC Berkeley 的 **Data 8** 课程Jordan 2015 年参与设计),融合"计算思维 + 推断思维":学生用 Python 直方图和置换检验回答真实世界问题(水质、森林砍伐等)。目前每学期 1500+ 学生,是伯克利历史上增长最快的课程。下一步:加入经济思维。
## 核心主张
- LLM 可被理解为**集体主义制品**——每次交互隐含地与数十亿贡献微数据的个体对话
- 「AI 匹敌的隐喻不是搜索引擎或聊天机器人,而是**市场**」
- 真正成熟的 AI 工程学科需要 Maxwell 方程组级别的**模块化透明设计概念**——当前远未达到
- 路径不在于将 AI 狭窄化为人脑模拟,而在于将**经济与推断原则融入算法设计的 DNA**
## 参考文献
- Bates et al. (2024). Principal-Agent Hypothesis Testing. arXiv:2205.06812
- Angelopoulos et al. (2023). Prediction-Powered Inference. Science 383, 669674
- Fallah et al. (2024). On Three-Layer Data Markets. arXiv:2402.09697

View File

@@ -0,0 +1,19 @@
# Structured Inference with Large Language Gibbs
- **arXiv**: 2606.19264v1
- **Published**: 2026-06-17
- **Authors**: Sanghyeok Choi, Henry Gouk, Esmeralda S. Whitammer (University of Edinburgh, CIFAR)
- **Categories**: cs.LG, cs.CL
- **Code**: https://github.com/hyeok9855/large-language-gibbs
- **Source**: https://arxiv.org/abs/2606.19264
## Abstract
Large Language Gibbs 是一种结构化概率推断方案,将 LLM 的条件分布用作 Gibbs 采样的转移算子transition operator。核心思想不通过单次自回归生成结构化对象而是迭代地根据其他变量重新采样单个变量利用 LLM 的 next-token conditional。这种方法避免了生成顺序依赖的偏差产生的稳态分布反映了所有局部条件之间的折衷。应用于合成分布采样、一致性推理GSM8K/TruthfulQA和贝叶斯结构学习。
## Key Contributions
1. 将 LLM 条件分布形式化为 Gibbs 采样转移算子,给出稳态分布 q^sym 的理论刻画
2. 提出三类核变体Basic Gibbs直接条件采样、Barker Gibbs偏好比较、Gambling Gibbs赌博决策
3. 随机排列策略消除变量顺序偏差
4. 三个应用场景验证:采样偏差纠正、一致性推理、因果结构先验

View File

@@ -0,0 +1,21 @@
# What Makes Effective Supervision in Latent Chain-of-Thought: An Information-Theoretic Analysis
- **arXiv**: 2606.20075v1
- **Published**: 2026-06-18
- **Authors**: Xinghao Chen, Chak Tou Leong, Wenjin Guo, Jian Wang, Wenjie Li, Xiaoyu Shen (Eastern Institute of Technology / Hong Kong Polytechnic University)
- **Categories**: cs.LG, cs.CL
- **Venue**: ICML 2026
- **Code**: https://github.com/EIT-NLP/Supervision-in-Latent-CoT
- **Source**: https://arxiv.org/abs/2606.20075
## Abstract
从信息论角度分析 Latent Chain-of-Thought 的有效监督机制。识别出 outcome supervision 的"双重崩溃"——梯度衰减和表示漂移。将过程监督分解为两个互补维度Trajectory Supervision注入密集逐步推理信号和 Space Supervision通过生成式重建保留潜空间的语义结构。提出 Unified Latent Probe (ULP) 量化潜轨迹与显式推理步骤之间的互信息。实验揭示 Information-Performance Binding推理精度严格受限于潜在链中保留的信息保真度。
## Key Contributions
1. 信息论分析框架:将 Latent CoT 监督形式化为互信息最大化问题
2. 双重崩溃诊断:梯度衰减 + 表征漂移是 outcome supervision 失败的根本原因
3. 过程监督的二维分解Trajectory Supervision × Space Supervision
4. ULP 探针:量化潜状态中的可恢复推理信息
5. Information-Performance Binding推理能力严格受限于信息保真度

View File

@@ -0,0 +1,31 @@
---
title: "LongMemEval: Benchmarking Long-Term Interactive Memory (Raw Archive)"
created: 2026-06-25
updated: 2026-06-25
type: raw
tags: ["memory-benchmark", "chat-assistant", "long-term-memory"]
source: "https://arxiv.org/abs/2410.10813"
---
# LongMemEval — Raw Archive
## Metadata
- **Title**: LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory
- **Authors**: Di Wu (UCLA), Hongwei Wang, Wenhao Yu (Tencent AI Lab Seattle), Yuwei Zhang (UC San Diego), Kai-Wei Chang (UCLA), Dong Yu (Tencent AI Lab Seattle)
- **Venue**: ICLR 2025
- **arXiv**: 2410.10813
- **Date**: 2024-10-14 (v1), 2025-03-04 (v2)
- **Category**: cs.CL
- **Code**: https://github.com/xiaowu0162/LongMemEval
## Abstract
Recent large language model (LLM)-driven chat assistant systems have integrated memory components to track user-assistant chat histories, enabling more accurate and personalized responses. However, their long-term memory capabilities in sustained interactions remain underexplored. We introduce LongMemEval, a comprehensive benchmark designed to evaluate five core long-term memory abilities of chat assistants: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. With 500 meticulously curated questions embedded within freely scalable user-assistant chat histories, LongMemEval presents a significant challenge to existing long-term memory systems, with commercial chat assistants and long-context LLMs showing a 30% accuracy drop on memorizing information across sustained interactions. We then present a unified framework that breaks down the long-term memory design into three stages: indexing, retrieval, and reading.
## Key Contributions
1. First comprehensive memory benchmark featuring 5 core abilities + abstention
2. Unified three-stage memory framework (indexing → retrieval → reading) with four control points
3. Empirically validated design optimizations: round granularity, fact-augmented keys, time-aware query expansion
4. Two standard settings: S (~115k tokens) and M (~1.5M tokens)

View File

@@ -0,0 +1,73 @@
---
title: "MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model"
created: 2026-06-20
source: "arXiv:2606.17800"
authors: "Lichen Bai, Tianhao Zhang, Shitong Shao, Dingwei Tan, Qiyu Zhong, Zhengpeng Xie, Haopeng Li, Qinghao Huang, Dandan Shen, Tengjiao Ji, Wei Wang, Peicheng Wu, Yuxuan Zhao, Xiangyu Zhu, Welly Luo, Shurui Yang, Zeke Xie"
venue: "arXiv preprint (cs.CV)"
date: "2026-06-16"
project: "https://mainecoon.tech/"
type: paper
---
# MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model
**Catnip AI Team** · arXiv:2606.17800 · 32 pages, 13 figures, 3 tables
## Abstract
As an increasing majority of global video content is consumed on social platforms for interactive social purposes, video generation models built for social worlds are important but largely overlooked. We define the position of **social world models** and build MaineCoon as the first step — a 22B real-time audio-visual autoregressive model capable of streaming generation and sub-second interaction at up to **47.5 FPS** on a single GPU.
Key innovations:
- **Self-resampling**: exposes model to degraded self-history during training
- **Cross-modal representation alignment**: token relation distillation with V-JEPA 2
- **Domain-aware preference optimization**: multi-domain LoRA DPO experts
- **Reinforced online-policy distillation (ROPD)**: consolidates domain experts into one deployable policy
- **Agentic streaming inference**: training-free framework with planner/observer, cache manager, buffer controller
MaineCoon supports thousand-second-scale generation while mitigating drift, and sets SOTA on the new **SocialVideo Bench** (9 evaluation metrics).
## 核心问题
全球大多数视频在社交平台上被消费,但现有视频生成模型(如 DiT 扩散模型)存在三大局限:
1. **离线非流式**:双向时间注意力导致无法实时输出
2. **忽略音频**:社交视频的语音、唇音同步、情感共鸣是关键
3. **缺乏长时稳定性**:分钟级自回归生成的内容漂移
## 方法论
### Training Pipeline (Section 3)
- **Native Streaming AR Training (3.1)**: 因果逐块自回归训练,通过 [[self-resampling|Self-Resampling]] 让模型适应自身产生的退化历史
- **Cross-modal Representation Alignment (3.2)**: 利用 [[jepa|V-JEPA 2]] teacher 的 token relation distillation 加速训练
- **Post-training (3.3)**: [[domain-aware-preference-optimization|Domain-Aware DPO]] 训练域专家,[[reinforced-online-policy-distillation|ROPD]] 将专家合并为单一策略
- **Step Distillation**: DMD-based 四步蒸馏,实现近乎无损的快速推理
### Agentic Streaming Inference (Section 4)
训练无关的推理框架,三个控制器包裹冻结生成器:
- **[[agentic-streaming-inference|Director]] (Planner & Observer)**: Gemma 4 26B agent 写 prompt 流 + 观察生成质量
- **[[agentic-cache-manager|Cache Manager]]**: 管理 KV-cache 的 keep-set + drift control
- **[[look-ahead-buffer-controller|Buffer Controller]]**: 控制生成与播放之间的 lead
### Data Pipeline (Section 2)
- Synthetic data via LTX-2.3 teacher + director-style LM scenario planning (225 scenes × 15 styles × 12 shots)
- Real social video curation: SCRFD face detection → SyncNet lip-sync verification → quality filtering
- 日处理能力:十万视频规模
## 关键结果
- **47.5 FPS** on single H100 GPU
- **<$0.001 per second** generation cost
- **45 minutes** continuous streaming without measurable degradation
- SOTA on SocialVideo Bench across 9 metrics vs. 7 open-source baselines
- 训练效率:<10K GPU hours, <1M clips
## 相关概念
- [[social-world-model|社交世界模型]]
- [[self-resampling|自重采样]]
- [[reinforced-online-policy-distillation|ROPD]]
- [[agentic-streaming-inference|Agentic 流式推理]]
- [[agentic-cache-manager|Agentic 缓存管理]]
- [[look-ahead-buffer-controller|先行缓冲控制]]
- [[forward-repair-ladder|前向修复阶梯]]
- [[socialvideo-bench|SocialVideo Bench]]
- [[audio-visual-representation-alignment|音视频表示对齐]]
- [[domain-aware-preference-optimization|域感知偏好优化]]

View File

@@ -0,0 +1,40 @@
---
title: "Characterizing, Evaluating, and Optimizing Complex Reasoning (ME² + TRM)"
author: "Haoran Zhang, Yafu Li, Zhi Wang, Zhilin Wang, Shunkai Zhang, Xiaoye Qu, Yu Cheng"
source: "arXiv 2602.08498v2"
date: "2026-02-09 (updated 2026-06-03)"
type: paper
venue: "ICML 2026 (cs.CL)"
tags: ["reasoning", "reward-model", "dag", "grpo", "test-time-scaling", "rl"]
code: "https://github.com/Simplified-Reasoning/TRM"
---
# Characterizing, Evaluating, and Optimizing Complex Reasoning
> Zhang, Li, Wang, Wang, Zhang, Qu, Cheng | SJTU / Shanghai AI Lab / CUHK / NJU / USTC / PKU
> ICML 2026 | arXiv:2602.08498v2 | cs.CL
## 三个核心问题
1. **Q1**:什么定义了高质量推理?
2. **Q2**:如何可靠评估长且隐式结构化的推理轨迹?
3. **Q3**:如何将此评估信号用于推理优化?
## 核心方案
### ME² 原则
沿两个正交轴表征推理质量:
- **Macro vs Micro**:全局结构组织 vs 局部步骤属性
- **Effectiveness vs Efficiency**:有效性 vs 效率
### DAG 推理建模
将推理轨迹抽象为有向无环图DAG显式建模推进、分支和合并。DAG 是树和完全图的实用折衷——捕获丰富结构,同时保持与生成顺序一致的拓扑排序。
### Thinking Reward Model (TRM)
- 基于 ME² + DAG pairwise evaluation 构建 TRM-Preference 数据集103K 训练对)
- 用 Bradley-Terry 目标训练轻量 TRMLlama-3.1-8B → scalar head
- 关键TRM 仅训练于 verified-correct reasoning 偏好对,与答案正确性监督解耦
### 优化信号
- Test-timeBest-of-N selection → +19.3%AIME24, Qwen3-8B
- TrainingTRM-guided GRPO with gated reward shaping → +3.9%

View File

@@ -0,0 +1,41 @@
---
title: "The Topological Trouble With Transformers"
source: arXiv
source_id: 2604.17121
authors:
- Michael C. Mozer (Google DeepMind)
- Shoaib Ahmed Siddiqui (Google DeepMind)
- Rosanne Liu (Google DeepMind)
published: 2026-04-18
updated: 2026-06-03
categories:
- cs.LG
- cs.AI
venue: Preprint
---
# The Topological Trouble With Transformers
## Abstract
Transformers encode structure in sequences via an expanding contextual history. However, their purely feedforward architecture fundamentally limits dynamic state tracking. State tracking—the iterative updating of latent variables reflecting an evolving environment—involves inherently sequential dependencies that feedforward networks struggle to maintain. Consequently, feedforward models push evolving state representations deeper into their layer stack with each new input step, rendering information inaccessible in shallow layers and ultimately exhausting the model's depth.
While this depth limit can be bypassed by dynamic depth models and by explicit or latent thinking that externalizes state representations, these solutions are computationally and memory inefficient. The authors argue that temporally extended cognition requires refocusing from explicit thought traces to implicit activation dynamics via recurrent architectures.
## Core Contributions
1. **Topological analysis** of why feedforward Transformers fundamentally cannot track state indefinitely
2. **Taxonomy of recurrent Transformer architectures** along two dimensions: recurrence axis (depth vs step) and input-tokens-per-recurrence-step ratio
3. **Identification of empty cells** in the taxonomy as promising research directions
4. **Critique of Chain-of-Thought as workaround** — it externalizes what should be implicit
5. **Roadmap** for enhanced SSMs, coarse recurrence, representational alignment, and efficient recurrence training
## Key Concepts
- state tracking, belief state, depth dilemma
- recurrent transformer architectures (depth/step/both)
- recurrence taxonomy: axis × ratio
- attractor dynamics, latent thought models
- enhanced state-space models (DeltaNet, RWKV-7, PaTH attention)
- representational alignment, coarse-grained recurrence
- sequential dependency, autoregressive unrolling
## URL
https://arxiv.org/abs/2604.17121

View File

@@ -0,0 +1,90 @@
---
title: "RWKV-7 \"Goose\" with Expressive Dynamic State Evolution"
authors: ["Bo Peng", "Ruichong Zhang", "Daniel Goldstein", "Eric Alcaide", "et al."]
date: 2025-03-18
arxiv_id: "2503.14456v2"
categories: ["cs.CL", "cs.AI", "cs.LG"]
affiliations: ["RWKV Project (Linux Foundation AI & Data)", "EleutherAI", "Tsinghua University", "et al."]
paper_type: "preprint"
code: "https://github.com/RWKV/RWKV-LM"
models: "https://huggingface.co/RWKV"
---
# RWKV-7 "Goose" with Expressive Dynamic State Evolution
## 摘要
RWKV-7 "Goose" 是一种新序列建模架构,具有常数内存使用和常数每 token 推理时间。尽管训练 token 数远少于同类顶级模型,其 2.9B 参数语言模型在多语言任务上达到新的 3B SoTA在英语下游性能上匹敌当前 3B SoTA。RWKV-7 核心创新:(1) 广义化的 delta 规则——带**向量值门控**和**上下文学习率**(2) 松弛值替换规则(解耦移除和添加的 key。理论上RWKV-7 可执行状态追踪并识别**所有正则语言**,超越 Transformer 的 TC^0 限制。附带发布了 3.1T token 多语言语料和四个预训练模型0.19B-2.9B),全部 Apache 2.0。
## 核心贡献
1. **广义 Delta 规则**:将 DeltaNet 的标量 delta 规则扩展到向量值门控和上下文学习率
2. **松弛值替换规则**:解耦移除 keyk_remove和添加 keyk_add允许更灵活的状态更新
3. **超越 TC^0 的表达力**:证明 RWKV-7 可识别所有正则语言NC^1单层即可解决 S5 状态追踪
4. **模型升级方法**:从 RWKV-5/6 checkpoint 升级训练而非从头 pretrain节省计算
5. **RWKV World v3 数据集**3.1T token 多语言开放语料
## 方法框架
### 从 DeltaNet 到广义 Delta Rule
传统 Delta 规则DeltaNet的形式
```
S_t = S_{t-1} - α · ∇l(S_{t-1}, k_t, v_t)
```
RWKV-7 的广义 Delta 规则引入三个创新:
**1. 向量值门控Vector-valued Gating**
```
S_t = S_{t-1} · (diag(w_t) - κ̂_t^T (a_t ⊙ κ̂_t)) + v_t^T · k_t
```
其中 w_t 是动态衰减flexible decaya_t 是向量值上下文学习率κ̂_t 是归一化的 key。
**2. 向量值上下文学习率in-context learning rate**
a_t 从标量升级为向量d 维),允许模型**逐通道**选择性替换状态数据。
**3. 广义特征值Generalized Eigenvalue**
进化矩阵可拥有 [0, 1] 区间外的特征值 → 表达能力超越标准 SSM。
### 与各架构对比
| 架构 | 大状态 | 灵活衰减 | 动态依赖 | 广义特征值 |
|------|--------|---------|---------|----------|
| RWKV-4 | ✗ | ✗ | ✗ | ✗ |
| Mamba | ✗ | ✓ | ✓ | ✗ |
| RWKV-6 / GLA | ✗ | ✓ | ✓ | ✗ |
| Gated DeltaNet | ✓ | ✗ | ✓ | ✓ |
| **RWKV-7** | ✓ | ✓ | ✓ | ✓ |
### 理论突破
RWKV-7 是**首个被证明超越 TC^0** 的并行化可训练 RNN 架构(在 TC^0 ≠ NC^1 猜想下):
- 单层可解决 S5 状态追踪NC^1 问题)
- 常数层可识别任意正则语言
- Transformerstandard被限制在 TC^0
## 实验结果
- **2.9B 多语言**3B 规模多语言 SoTA英语匹敌当前 3B SoTA
- **训练效率**:训练 token 远少于同等规模模型
- **长上下文**:常数内存,推理成本不随序列长度增长
- **关联回忆Associative Recall**:在合成任务上显著优于 RWKV-6
## 关键概念
- [[delta-rule]] → [[generalized-delta-rule]] — Delta 规则的演进路径
- [[vector-valued-gating]] — RWKV-7 的向量值门控机制
- [[in-context-learning-rate]] — 逐通道上下文学习率
- [[dynamic-state-evolution]] — 动态状态演化机制
- [[token-shift]] — RWKV 家族的时间混合技巧
- [[regular-language-recognition]] — 理论突破:识别所有正则语言
- [[wkv-time-mixing]] — RWKV-7 的 WKV 时间混合机制
## 参考
- 代码https://github.com/RWKV/RWKV-LM
- 模型https://huggingface.co/RWKV
- DeltaNet (Schlag et al., 2021)
- RWKV-6 / Finch (Peng et al., 2024)

View File

@@ -0,0 +1,40 @@
---
title: "The Personalization Trap: How User Memory Alters Emotional Reasoning in LLMs"
author: "Xi Fang*, Weijie Xu*, Yuchong Zhang, Stephanie Eckman, Scott Nickleach, Chandan K. Reddy (Amazon)"
source: "arXiv 2510.09905v2"
date: "2025-10-10 (updated 2026-06-16)"
type: paper
venue: "arXiv (cs.AI, cs.CL)"
tags: ["personalization", "memory", "emotional-intelligence", "bias", "social-capital", "dpo"]
code: "https://github.com/personalization-trap"
dataset: "Datasets Repository"
---
# The Personalization Trap: How User Memory Alters Emotional Reasoning in LLMs
> Xi Fang*, Weijie Xu*, Yuchong Zhang, Stephanie Eckman, Scott Nickleach, Chandan K. Reddy
> Amazon | arXiv:2510.09905v2 | cs.AI / cs.CL
## 核心问题
当 AI 助手记得"Sarah 是打两份工的单亲妈妈"时,它对她压力的解读是否会不同于"Sarah 是富有的高管"?个性化 AI 系统越来越多地融入长期用户记忆,但这如何影响情感推理尚未被研究。
## 方法
1. **用户画像生成**:基于 Bourdieu 社会资本框架30 个基础画像各生成 advantaged/disadvantaged 两个版本 + 81 个交叉性画像(性别×年龄×宗教×种族)
2. **情感理解评估**STEU42 个情感识别场景)+ 改良 STEM44 个第一人称情感建议场景),经人类专家验证去除画像敏感题目
3. **统计建模**:混合效应模型估算人口统计学效应
## 关键发现
**发现 1**用户记忆系统性影响情感理解。15 个模型中 11 个显著偏离无记忆基线。Claude 3.7 Sonnet优势画像 80.10% vs 劣势画像 77.37%p<0.05)。
**发现 2**人口统计学偏见显著穆斯林非二元性别65+ 画像得分偏低Claude 3.7 对女性/非二元性别的情绪建议显著差于男性但偏见方向因模型而异——无统一模式
**发现 3**"thinking" 模型偏见低于标准版本但偏见在情绪建议任务中持续存在
**发现 4**通过 DPO 在精心策划的偏好数据集上训练500 样本可减少偏见影响同时保持通用能力Gemma-2-2B Bias Influence 5.50% 降至 -2.30%。
## 核心洞察
"记住你是谁的记忆绝不应该决定它有多在乎你"——个性化可能在不经意间将社会等级编码进 AI 的情感推理

View File

@@ -0,0 +1,59 @@
---
title: "Predicting Future Utility: Global Combinatorial Optimization for Task-Agnostic KV Cache Eviction"
authors: ["Ziyao Tang", "Pengkun Jiao", "Xinhang Chen", "Wei Liu", "Shiyong Li", "Jingjing Chen"]
date: 2026-02-09
arxiv_id: "2602.08585v2"
categories: ["cs.LG", "cs.AI"]
venue: "ICML 2026"
affiliations: ["Fudan University", "Baidu Inc. (Baige AI Team)"]
paper_type: "conference"
---
# Predicting Future Utility: Global Combinatorial Optimization for Task-Agnostic KV Cache Eviction
## 摘要
KV cache 的线性内存增长是大模型长上下文推理的核心瓶颈。现有 KV cache eviction 方法依赖瞬时启发式指标instantaneous heuristic metrics假设注意力分数在所有 head 中都是一致的重要性代理。然而,不同 attention head 在预测保真度predictive fidelity上存在异质性某些 head 侧重即时贡献另一些则捕捉长期效用long-horizon utility。本文提出 LU-KV 框架,将 head 级别预算分配建模为全局组合优化问题通过凸包松弛convex-hull relaxation和边际效用贪心求解器获得近优解并设计离线 profiling 协议支持实际部署。在 LongBench 和 RULER 上以 80% KV cache 压缩率实现最小性能损失。
## 核心贡献
1. 识别了启发式重要性指标与长视界边际效用之间的关键差距optimality gap
2. 将预算分配形式化为长期效用最大化问题,提出凸包松弛 + 边际效用贪心求解器
3. 设计了数据驱动的离线 profiling 协议,使理论优化可在实际推理中部署
4. 指标无关metric-agnostic可适配 SnapKV、KeyDiff、CAKE、KVZip 等多种 intra-head 评分方法
## 关键概念
- [[oracle-importance]]Oracle 重要性,基于未来解码窗口中 token 对输出向量的最大潜在贡献
- [[optimality-gap]]:启发式指标与 Oracle 指标之间的最优性差距
- [[long-horizon-utility]]:长视界效用,区别于瞬时注意力分数
- [[global-combinatorial-optimization]]:全局预算分配的组合优化形式化
- [[convex-hull-relaxation]]:通过 PAVA 等保序回归方法对离散损失序列做凸松弛
- [[marginal-utility]]:边际效用,用于驱动贪心分配策略
- [[offline-profiling]]:合成上下文 → Oracle 计算 → Profile 聚合的三阶段离线校准
## 实验结果
- LongBench80% 压缩率下LU-KV 在 Llama-3.1-8B、Mistral-7B、Qwen2.5-32B 上全面优于 Uniform、PyramidKV、AdaKV 等基线
- RULER在 4K-128K 扩展上下文窗口下保持检索鲁棒性
- 离线 profile 在不同任务间具有高度一致的迁移性transferability
- 可兼容 SnapKV、KeyDiff、CAKE、KVZip 等多种 intra-head 指标
## 方法框架
LU-KV 采用两阶段范式:
1. **Intra-head scoring**:使用任意启发式指标 π 对 token 评分排序
2. **Cross-head budget allocation**:通过全局组合优化确定每个 head 的最优预算 b_{,h}
核心分解:`Eviction Loss = Oracle Metric Loss + Optimality Gap Loss`
## 参考文献
- SnapKV (Li et al., 2024)
- H2O (Zhang et al., 2023)
- PyramidKV (Cai et al., 2024)
- AdaKV (Feng et al., 2026b)
- KeyDiff (Park et al., 2025)
- CriticalKV (Feng et al., 2025)
- KVZip (Kim et al., 2026)
- CAKE (Qin et al., 2025)

View File

@@ -0,0 +1,45 @@
---
title: "Unlimited OCR Works: Welcome the Era of One-shot Long-horizon Parsing"
author: "Youyang Yin, Huanhuan Liu*, YY†, et al. (Baidu Inc.)"
source: "arXiv 2606.23050"
date: "2026-06-22"
type: paper
venue: "arXiv (cs.CV, cs.CL)"
tags: ["ocr", "attention-mechanism", "long-horizon", "kv-cache", "r-swa", "end-to-end"]
code: "https://github.com/baidu/Unlimited-OCR"
---
# Unlimited OCR Works
> Youyang Yin, Huanhuan Liu*, YY†, Qunyi Xie, Chaorun Liu, Shiqi Yang, Shaohua Wang, Zhanlong Liu, Hao Zou, Jinyue Chen, Shu Wei, Jingjing Wu, Mingxin Huang, Zhen Wu, Guibin Wang, Tengyu Du, Lei Jia
> Baidu Inc. | arXiv:2606.23050 | Jun 2026
## 核心问题
现有端到端 OCR 模型(如 DeepSeek OCR用 LLM 作解码器,利用语言先验提升精度,但代价是输出序列增长导致 KV cache 线性膨胀,推理速度持续下降。人类在长程抄写任务中效率不降,这是一个根本性的架构瓶颈。
## 核心方案Reference Sliding Window Attention (R-SWA)
提出 **R-SWA** — 一种模仿人类解析工作记忆的注意力机制:
1. 每个生成的 token 关注全部参考 token视觉 token + prompt 前 n 个输出 token默认 n=128
2. 参考 token 不参与状态转移,避免视觉特征逐渐模糊
3. KV cache 保持恒定大小 Lm + n不随解码长度增长
4. 整个解码过程推理速度TPS和 GPU 内存恒定
## 关键结果
- 以 DeepSeek OCR 为基线,替换所有 decoder attention 为 R-SWA
- OmniDocBench v1.5**93% Overall**,比 DeepSeek OCR 基线高 6pp
- OmniDocBench v1.6:与 SOTA 持平93.54%
- 长程解析2-40+ 页书籍Distinct-n > 96%Edit Distance < 0.11
- 推理效率6000 token TPS DeepSeek OCR 35%
- 3B 参数MoE 架构激活仅 500M
## 局限性
受限于 prefill 长度当前 32K不能真正无限解析短期方向训练 128K 上下文长期方向构建 prefill pool 模拟翻页效果
## 泛化性
R-SWA 是通用的解析注意力机制 OCR 同样适用于 ASR翻译等基于参考的长程任务

View File

@@ -0,0 +1,41 @@
---
title: "VLA-JEPA: Enhancing VLA with Latent World Model"
author: "Jingwen Sun*, Wenyao Zhang*, Zekun Qi, Shaojie Ren, Zezhi Liu, Hanxin Zhu, Guangzhong Sun, Xin Jin†, Zhibo Chen†"
source: "arXiv 2602.10098v2"
date: "2026-02-10 (updated 2026-02-14)"
type: paper
venue: "arXiv (cs.RO, cs.CV)"
tags: ["vla", "jepa", "world-model", "robot-learning", "pretraining", "latent-action"]
code: "https://github.com/ginwind/VLA-JEPA/"
---
# VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model
> Sun*, Zhang*, Qi, Ren, Liu, Zhu, Sun, Jin†, Chen†
> USTC / SJTU / Tsinghua / EIT / UCAS / Nankai | arXiv:2602.10098v2 | cs.RO / cs.CV
## 核心问题
当前 VLA 的 latent-action 预训练目标学错了东西:它们锚定在像素变化而非动作相关的状态转移上,导致四种失败模式:
1. 像素级目标偏向外观而非动作语义
2. 真实视频中相机运动和背景变化主导信号
3. 信息泄漏使 latent action 坍缩为捷径(编码未来而非转移动态)
4. 多阶段训练流水线复杂且脆弱
## 核心方案Leakage-free State Prediction
VLA-JEPA 将 JEPA 范式引入 VLA 预训练:
- Target encoder 从未来帧产生 latent target仅作监督永不作为输入
- Student 仅见当前观察
- 在 latent space非 pixel space预测——天然鲁棒于相机运动和背景变化
- 简单两阶段JEPA 预训练 → Action-head 微调
架构Qwen3-VL-2B (VLM backbone) + V-JEPA2 encoder (world model) + Flow-Matching action head
## 关键结果
- **LIBERO**SOTA 平均成功率4 个 task suite 中 2 个最优
- **SimplerEnv**Google Robot 最高平均成功率WidowX 第二
- **LIBERO-Plus**7 个扰动维度下的强劲鲁棒性
- **数据效率**:使用远少于对比方法的训练数据达到更优性能
- **Real-world Franka**:真实机器人验证成功

View File

@@ -0,0 +1,45 @@
---
title: "Fisher Width: A Geometric Measure of Complexity on Statistical Manifolds"
source_id: "arXiv:2606.18306v1"
authors:
- "Vu Khac Ky"
affiliations: "Department of Mathematics, FPT University, Vietnam"
date: 2026-06-16
categories: ["cs.LG", "stat.ML"]
pages: 48
figures: 3
url: "https://arxiv.org/abs/2606.18306v1"
---
# Fisher Width: A Geometric Measure of Complexity on Statistical Manifolds
**Authors**: Vu Khac Ky (FPT University, Vietnam)
**arXiv**: 2606.18306v1 | **Date**: 2026-06-16
**Categories**: cs.LG (Machine Learning), stat.ML (Machine Learning)
**48 pages, 3 figures**
## Abstract
Gaussian width is a central geometric complexity measure in high-dimensional probability, compressed sensing, convex optimization, and learning theory. It quantifies the average extent of a set along random directions, thereby capturing the effective dimension of constraint sets, hypothesis classes, and descent cones. However, this notion is intrinsically Euclidean. Statistical models instead carry a natural Riemannian geometry induced by the Fisher information metric, where directions are scaled according to statistical distinguishability rather than ambient Euclidean length.
We introduce **Fisher width**, a Fisher-geometric analogue of Gaussian width for statistical manifolds. At a parameter point θ, Fisher width replaces the Euclidean identity by the local metric tensor G(θ)^{1/2}, measuring the Gaussian width of the Fisher-rescaled set. This makes the resulting quantity sensitive to local statistical curvature and invariant under smooth reparameterizations.
We develop the basic theory of Fisher width, showing that it retains key structural features of Gaussian width, including concentration, metric perturbation stability, and spectral comparison bounds with the Euclidean baseline, while also capturing anisotropic geometric effects invisible to Euclidean measures. As an application, we prove a generalization bound for Fisher-Lipschitz hypothesis classes and propose computable estimators, which we evaluate empirically on MNIST across three model classes.
Fisher width is to statistical manifolds what Gaussian width is to Euclidean convex bodies. This work lays the foundation for studying complexity and learning on curved statistical manifolds.
## Key Contributions
1. **Fisher Width Definition**: Introduces Fisher width as a local Fisher-geometric analogue of Gaussian width, with the lifting identity w_G(T;θ) = w(G(θ)^{1/2} T) and reparameterization invariance.
2. **Structural Theory**: Concentration inequalities, algebraic properties, spectral comparison bounds, and stability under metric perturbations.
3. **Generalization Bound**: For Fisher-Lipschitz hypothesis classes, uniform deviation controlled by w_G(TT;θ₀)/√n, with tightness proof for exponential-family models.
4. **Practical Estimators**: Empirical Fisher, randomized low-rank approximation, and score-based sampling, validated on MNIST (logistic/softmax/ridge regression).
## Key Concepts
- [[gaussian-width|Gaussian Width]] — Euclidean foundational complexity measure
- [[statistical-manifold|Statistical Manifold]] — Riemannian manifold with Fisher metric
- [[fisher-information-metric|Fisher Information Metric]] — Local metric tensor G(θ)
- [[fisher-lipschitz|Fisher-Lipschitz]] — Hypothesis class with Fisher-geometric smoothness
- [[lifting-identity|Lifting Identity]] — w_G(T;θ) = w(G(θ)^{1/2} T)
- [[empirical-fisher|Empirical Fisher]] — Score-based computation of Fisher information

View File

@@ -0,0 +1,18 @@
# Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models
- **arXiv**: 2606.25041
- **Published**: 2026-06-23
- **Authors**: Lianghua Huang, Zhifan Wu, Wei Wang, Yupeng Shi, Mengyang Feng, Junjie He, Chenwei Xie, Yu Liu, Jingren Zhou, Ang Wang, Bang Zhang, Baole Ai, Chen Liang, Cheng Yu, Chongyang Zhong, Jinwei Qi, Kai Zhu, Pandeng Li, Peng Zhang, Wenyuan Zhang, Xinhua Cheng, Yitong Huang, Yun Zheng, Zoubin Bi (Wan Team, Alibaba Group)
- **Categories**: cs.CV, cs.AI, cs.GR, cs.SD
- **Website**: https://wan-streamer.com
- **Source**: https://arxiv.org/abs/2606.25041
## Abstract
Wan-Streamer is a native-streaming, end-to-end interactive foundation model for real-time, low-latency, full-duplex audio-visual interaction. It models language, audio, and video as both input and output within a single Transformer using block-causal attention for incremental streaming. Unlike cascaded systems relying on separate VAD, ASR, language, TTS, audio-driven animation, or video-generation modules, Wan-Streamer jointly learns perception, reasoning, generation, response timing, turn management, and cross-modal synchronization within one unified model, reducing pipeline latency and error accumulation. Streaming units are as short as 160 ms at 25 fps, with ~200 ms model-side response latency and ~550 ms total interaction latency.
## Key Contributions
1. End-to-end multimodal interactive foundation model — language, audio, video as both input and output in one Transformer
2. Fully causal multimodal architecture: causal audio/video VAEs, causal encoders/decoders, block-causal attention, full-history autoregressive streaming
3. Thinker-performer inference pipeline with KV-cache exchange, ~200ms model-side latency, ~550ms total

View File

@@ -0,0 +1,51 @@
---
title: "ACE-Router: Generalizing History-Aware Routing from MCP Tools to the Agent Web"
created: 2026-06-19
updated: 2026-06-19
type: paper-raw
source: https://arxiv.org/abs/2601.08276
arxiv_id: 2601.08276
version: v2
---
# ACE-Router: Generalizing History-Aware Routing from MCP Tools to the Agent Web
**Authors**: Zhiyuan Yao (ZJU), Zishan Xu (SJTU), Yifu Guo (SYSU), Zhiguang Han (NTU), Cheng Yang (HDU), Shuo Zhang, Weinan Zhang (SJTU), Xingshan Zeng, Weiwen Liu (Huawei)
**Published**: 2026-01-13 (v2: 2026-04-19)
**Venue**: arXiv:2601.08276 (cs.AI)
**Code**: https://github.com/euyis1019/ACE-Router
## 核心洞察
ACE-Router 将 MCP 工具选择重新定义为**训练一个历史感知路由器**的问题——不是用 embedding 做静态匹配,而是让路由器理解多轮对话历史来做上下文感知的精确路由。
## 三大阶段
### 1. Candidate Graph + Self-Evolutionary Mutation
- 基于语义相似度构建候选图(阈值 τ=0.82
- 五种变异算子Function Enhancement, Parameter Mutation, Workflow Chaining, Helper Operation, Usage Extension
- 627 初始工具 → 2005 工具(通过变异扩展)
### 2. Trajectory Synthesis多 Agent 模拟)
- 从候选图采样(随机游走 DFS
- Planner Agent + User Agent + Assistant Agent + Tool Agent 四角色模拟
- 环境无关设计:无需真实 APILLM 模拟执行结果
- 产出 15,092 个历史感知路由训练样本
### 3. Light Routing Agent (LRA)
- 仅两个工具router_invoke + tool_execute
- 解耦路由决策与任务执行
- 可插拔:适配工具路由和 Agent 路由
## 关键结果
| 方法 | MCP-Universe | MCP-Mark |
|------|:---:|:---:|
| Text-Emb-3-Large (Q) | ~40.95% | ~29.89% |
| ReAct (Gemini-2.5-Pro) | ~41.80% | ~50.00% |
| GPT-4o Router | ~47.41% | ~48.00% |
| **ACE-Router (Qwen3-8B)** | **53.44%** | **60.00%** |
- 扩展候选池ReAct 41.80→36.47%ACE-Router 稳定在 53.02%
- 噪声环境GPT-4o 28% / Gemini 32%ACE-Router 保持 56%
- 多 Agent 泛化无需额外训练router 直接泛化到 Agent 路由

View File

@@ -0,0 +1,53 @@
---
title: "A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications"
created: 2026-06-19
updated: 2026-06-19
type: paper-raw
source: https://arxiv.org/abs/2605.07358
arxiv_id: 2605.07358
version: v3
---
# A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
**Authors**: Yingli Zhou, Shu Wang, Yaodong Su, Wenchuan Du, Yixiang Fang, Xuemin Lin
**Affiliation**: The Chinese University of Hong Kong, Shenzhen
**Published**: 2026-05-08 (v3: 2026-05-26)
**Venue**: arXiv:2605.07358 (cs.IR)
**Resources**: https://github.com/JayLZhou/Awesome-Agent-Skills
## Abstract
LLM-based agents that reason, plan, and act through tools, memory, and structured interaction are emerging as a promising paradigm for automating complex workflows. This survey examines the challenge through the lens of **agent skills**, defined as reusable procedural artifacts that coordinate tools, memory, and runtime context under task-specific constraints. Agents handle high-level reasoning and planning, while skills form the operational layer that enables reliable, reusable, and composable execution.
The literature is organized around four stages of the agent skill lifecycle: **representation**, **acquisition**, **retrieval**, and **evolution**. The paper also discusses open challenges in quality control, interoperability, safe updating, and long-term capability management.
## Key Contributions
1. Identifies agent skills as a foundational component of LLM agent ecosystems, characterizing their role in bridging the **procedural gap** between raw tool access and robust task execution.
2. Organizes research around four lifecycle stages with representative methods in each.
3. Summarizes agent skills platforms (SkillNet, ClawHub, SkillHub, SkillsMP, Skills.sh), application scenarios, and open challenges.
## Formal Definition
A skill is a tuple **S = (M, R, C)**:
- **M**: root instruction document
- **R**: auxiliary resources (references, templates, scripts)
- **C**: applicability conditions (metadata, descriptions, embeddings)
## Taxonomy at a Glance
| Stage | Categories |
|-------|-----------|
| Representation | Text-Based, Code-Backed, Hybrid-Based |
| Acquisition | Human-Derived, Experience-Derived, Task-Derived, Corpus-Derived |
| Retrieval | Dense Embedding, Sparse/Keyword, Generative, Structure-Aware (Hierarchical + Dependency Graph) |
| Selection | Context-Aware, Skill Composition, Cost/Utility-Aware, Feedback-Driven |
| Evolution | Skill Revision, Skill Validation, Policy Coupling, Repository Evolution, Runtime Governance |
## Open Challenges
- **Acquisition**: Abstraction quality, weak trigger specification, resource drift, admission quality at scale
- **Retrieval**: Scalable skill libraries, constraint-aware composition, multi-objective selection, execution-centric evaluation
- **Evolution**: Coarse artifact-level evaluation, asymmetric revision (add > rewrite/retire), weakly specified repository governance, confounded gains
- **Future**: Unified skill schema, resource-aware joint optimization, lifecycle-level robustness, causality-driven skill diagnosis