20260514:增加新内容

2026-05-14 13:54:52 +08:00
parent 56c4d3ef7c
commit b116710e4c
294 changed files with 10682 additions and 255 deletions
--- a/raw/papers/bartoldson-tba-2025.md
+++ b/raw/papers/bartoldson-tba-2025.md
@@ -0,0 +1,43 @@
+---
+title: "Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training"
+authors: ["Brian Bartoldson", "Siddarth Venkatraman", "James Diffenderfer", "Moksh Jain", "Tal Ben-Nun", "Seanie Lee", "Minsu Kim", "Johan Obando-Ceron", "Yoshua Bengio", "Bhavya Kailkhura"]
+year: 2025
+arxiv: "2503.18929"
+venue: "NeurIPS 2025"
+institutions: ["Lawrence Livermore National Laboratory", "Mila", "Université de Montréal", "KAIST", "CIFAR"]
+type: "paper"
+created: 2026-05-12
+tags: ["reinforcement-learning", "llm-post-training", "gflownet", "asynchronous-rl", "off-policy"]
+---
+
+# Trajectory Balance with Asynchrony
+
+**Authors**: Brian Bartoldson, Siddarth Venkatraman, James Diffenderfer, Moksh Jain, Tal Ben-Nun, Seanie Lee, Minsu Kim, Johan Obando-Ceron, Yoshua Bengio, Bhavya Kailkhura
+**Venue**: NeurIPS 2025
+**arXiv**: [2503.18929](https://arxiv.org/abs/2503.18929)
+**Code**: https://github.com/bbartoldson/TBA
+
+## Abstract
+
+TBA introduces an asynchronous RL framework for LLM post-training using the off-policy Trajectory Balance (TB) objective from GFlowNets. By decoupling Searcher nodes (exploration) from Trainer nodes (policy updates) and using a replay buffer, TBA achieves 4×–50× speedups while matching or exceeding on-policy baselines (PPO, GRPO, RLOO, Online DPO).
+
+## Key Findings
+
+- TB objective enables principled off-policy learning, resistant to the staleness of asynchronous data
+- Recency + reward sampling from replay buffer balances exploration and exploitation
+- TBA creates new Pareto frontiers: KL vs. win-rate (preference tuning), diversity vs. toxicity (red-teaming)
+- On MATH with Qwen 2.5 7B, TBA′ outperforms Dr. GRPO in highly off-policy settings
+- Scaling searchers improves red-teaming performance (more attacks + more diverse)
+
+## Key Concepts
+
+- [[tba|TBA]] — Trajectory Balance with Asynchrony framework
+- [[trajectory-balance-objective]] — Off-policy TB objective from GFlowNets
+- [[asynchronous-rl-llm]] — Decoupling exploration from learning
+- [[off-policy-llm-post-training]] — Training on off-policy data
+- [[gflownet-fine-tuning]] — GFlowNets for LLM fine-tuning
+- [[replay-buffer-rl-llm]] — Replay buffer in LLM RL
+- [[searcher-trainer-decoupling]] — Architecture pattern
+- [[reward-recency-sampling]] — TBA's sampling strategy
+- [[grpo]] — On-policy baseline
+- [[rlvr-unified-framework]] — RLVR training paradigm
--- a/raw/papers/dai-mathforge-2026.md
+++ b/raw/papers/dai-mathforge-2026.md
@@ -0,0 +1,60 @@
+---
+title: "Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation"
+authors: ["Yanqi Dai", "Yuxiang Ji", "Xiao Zhang", "Yong Wang", "Xiangxiang Chu", "Zhiwu Lu"]
+year: 2026
+arxiv: "2601.20614"
+venue: "ICLR 2026"
+institutions: ["Renmin University", "AMAP Alibaba Group", "Xiamen University", "Dalian University of Technology"]
+type: "paper"
+created: 2026-05-12
+tags: ["mathematical-reasoning", "reinforcement-learning", "grpo", "difficulty-aware", "data-augmentation"]
+---
+
+# Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation
+
+**Authors**: Yanqi Dai, Yuxiang Ji, Xiao Zhang, Yong Wang, Xiangxiang Chu, Zhiwu Lu
+**Venue**: ICLR 2026
+**arXiv**: [2601.20614](https://arxiv.org/abs/2601.20614)
+**Code**: https://github.com/AMAP-ML/MathForge
+
+## Abstract
+
+Reinforcement Learning with Verifiable Rewards (RLVR) offers a robust mechanism for enhancing mathematical reasoning in large models. However, the authors identify a systematic lack of emphasis on more challenging questions in existing methods from both algorithmic and data perspectives.
+
+**Algorithmically**: GRPO suffers from an implicit imbalance — the magnitude of policy updates is lower for harder questions, peaking at p=0.5 accuracy rate.
+
+**Data-wise**: Augmentation approaches primarily rephrase questions to enhance diversity without systematically increasing intrinsic difficulty.
+
+**Solution: MathForge** — a two-dual framework comprising:
+1. **DGPO** (Difficulty-Aware Group Policy Optimization): rectifies GRPO's imbalance via difficulty-balanced group advantage estimation (DGAE) and difficulty-aware question-level weighting (DQW)
+2. **MQR** (Multi-Aspect Question Reformulation): reformulates questions across multiple aspects (Background, Term, Sub-Problem) to increase difficulty while preserving the original gold answer
+
+## Key Findings
+
+- GRPO's total update magnitude for a single question is ∝ 2G√(p(1-p)), peaking at p=0.5
+- DGAE replaces std with MAD, achieving constant update magnitude (G) regardless of accuracy
+- MathForge achieves 42.17% avg on 6 benchmarks vs 37.61% for GRPO (Qwen2.5-Math-7B)
+- MQR generates three types of reformulations with 97-99% answer preservation rate
+
+## Core Equations
+
+**GRPO Advantage (imbalanced)**:
+$$\hat{A}_{GR,i} = rac{r_i - 	ext{mean}(\{r_i\}_{i=1}^G)}{	ext{std}(\{r_i\}_{i=1}^G)}$$
+
+**DGAE Advantage (balanced)**:
+$$\hat{A}_{DG,i} = rac{r_i - 	ext{mean}(\{r_i\}_{i=1}^G)}{	ext{MAD}(\{r_i\}_{i=1}^G)}$$
+
+**DQW Weighting**:
+$$\lambda_s = B_v \cdot rac{\exp(D_s/T)}{\sum_{s=1}^{B_v} \exp(D_s/T)}, \quad D_s = -	ext{mean}(\{r_{si}\}_{i=1}^G)$$
+
+## Key Concepts
+
+- [[dgpo|DGPO]] — Difficulty-Aware GRPO algorithm
+- [[dgae|DGAE]] — Difficulty-Balanced Group Advantage Estimation
+- [[dqw|DQW]] — Difficulty-Aware Question-Level Weighting
+- [[mqr|MQR]] — Multi-Aspect Question Reformulation
+- [[mathforge]] — The complete MathForge framework
+- [[grpo]] — Group Relative Policy Optimization
+- [[update-magnitude-imbalance]] — GRPO's implicit difficulty imbalance
+- [[math-question-reformulation]] — MQR's three reformulation strategies
+- [[rlvr-unified-framework]] — RLVR training paradigm
--- a/raw/papers/deepseek-visual-primitives-2026.md
+++ b/raw/papers/deepseek-visual-primitives-2026.md
@@ -0,0 +1,67 @@
+---
+title: "Thinking with Visual Primitives"
+authors: "Ruijie Lu, Yiyang Ma, Xiaokang Chen (Project Lead), Lingxiao Luo, Zhiyu Wu, Zizheng Pan, Xingchao Liu, Yutong Lin, Hao Li, Wen Liu, Zhewen Hao, Xi Gao, Shaoheng Nie, Yixuan Wei, Zhenda Xie, Ting Chen, Gang Zeng"
+affiliations: "DeepSeek-AI, Peking University, Tsinghua University"
+year: 2026
+source: "https://github.com/deepseek-ai/Thinking-with-Visual-Primitives"
+domain: "Multimodal AI / Visual Reasoning"
+tags: [visual-primitives, multimodal, chain-of-thought, spatial-reasoning, token-efficiency, deepseek]
+---
+
+# Thinking with Visual Primitives
+
+**Authors:** Ruijie Lu¹²\*, Yiyang Ma¹\*, Xiaokang Chen¹\*‡, Lingxiao Luo¹³\*, Zhiyu Wu¹\*, Zizheng Pan¹\*, Xingchao Liu¹\*, Yutong Lin¹, Hao Li¹, Wen Liu¹, Zhewen Hao¹, Xi Gao¹, Shaoheng Nie¹, Yixuan Wei¹, Zhenda Xie¹, Ting Chen³, Gang Zeng²
+- ¹ DeepSeek-AI, ² Peking University, ³ Tsinghua University
+- \* Core contributors, ‡ Project lead
+
+## Abstract
+
+Despite the remarkable progress in Multimodal Large Language Models (MLLMs), the prevailing Chain-of-Thought (CoT) paradigms remain predominantly confined to the linguistic space. While recent advancements have focused on bridging the [[perception-gap]] through high-resolution cropping, they overlook a more fundamental bottleneck: the **[[reference-gap]]**. The inherent ambiguity of natural language often fails to provide precise, unambiguous pointers to complex spatial layouts, leading to logical collapse in tasks requiring rigorous grounding.
+
+In this work, the authors introduce **Thinking with Visual Primitives**, a novel reasoning framework that elevates spatial markers—such as points and bounding boxes—to "minimal units of thought". By interleaving these [[visual-primitives]] directly into the thinking process, the model can "point" while it "reasons", effectively grounding its cognitive trajectory in the physical coordinates of the image.
+
+The framework is built on a highly optimized architecture with extreme visual token efficiency. Despite its compact model scale and significantly lower image-token budget, the model achieves frontier-competitive performance on challenging visual QA tasks, matching or exceeding models such as GPT-5.4, Claude-Sonnet-4.6, and Gemini-3-Flash.
+
+## Key Concepts
+
+- [[visual-primitives]] — Bounding boxes and points as minimal cognitive units
+- [[reference-gap]] — Language ambiguity in spatial referencing
+- [[perception-gap]] — Seeing vs. reasoning in MLLMs
+- [[compressed-sparse-attention]] — KV cache compression (7056× ratio)
+- [[mixture-of-experts]] — 284B total / 13B active parameters
+- [[specialized-sft]] — Train separate experts for box/point primitives
+- [[specialized-rl]] — GRPO-based RL per expert
+- [[group-relative-policy-optimization]] — RL algorithm
+- [[unified-rft]] — Rejection Fine-Tuning to unify experts
+- [[on-policy-distillation]] — Consolidating expert capabilities
+- [[coarse-grained-counting]] — Category-level counting with boxes
+- [[fine-grained-counting]] — Attribute-constrained counting
+- [[maze-navigation]] — Topological reasoning with point primitives
+- [[path-tracing]] — Curve following with visual primitives
+- [[exponential-decay-reward]] — Smooth reward for counting accuracy
+- [[bidirectional-trajectory-evaluation]] — Forward+reverse path scoring
+- [[token-efficiency]] — 7056× overall compression from pixels to KV cache
+
+## Architecture
+
+- Vision: [[deepseek-vit]] (in-house ViT, 14×14 patch, 3×3 spatial compression)
+- Language: [[deepseek-v4-flash]] (284B MoE, 13B active)
+- KV cache: [[compressed-sparse-attention]] — further 4× compression
+- Overall compression: 756×756 image → 2,916 patches → 324 visual tokens → 81 KV entries (7056×)
+
+## Training Pipeline
+
+1. **Pretraining**: Web-scale data curation (97,984 sources → 31,701 after filtering, >40M samples) for visual primitive capabilities
+2. **Specialized SFT**: Separate training for box-grounding (FTwG) and point-tracking (FTwP)
+3. **Specialized RL**: GRPO with Format/Quality/Accuracy reward models
+4. **Unified RFT**: On-policy rollouts → rejection sampling → unified SFT
+5. **On-Policy Distillation**: KL-divergence consolidation of expert models
+
+## Key Results
+
+- CountQA: 66.1/75.1 (EM/RA@10) vs Gemini-3-Flash 48.3/60.3
+- Pixmo-Count: 89.2 EM
+- SpatialMQA: 69.4 ACC
+- DS_Maze_Navigation: 66.9 ACC (frontier models ~49-50)
+- DS_Path_Tracing: 56.7 ACC (frontier models ~25-46)
+- Token consumption: ~90 KV entries vs 660-1100 for frontier models
--- a/raw/papers/dou-cl-bench-2026.md
+++ b/raw/papers/dou-cl-bench-2026.md
@@ -0,0 +1,39 @@
+# CL-bench: A Benchmark for Context Learning
+
+## Metadata
+- **Title**: CL-bench: A Benchmark for Context Learning
+- **Authors**: Shihan Dou, Ming Zhang, Zhangyue Yin, Chenhao Huang et al. (27 authors from Fudan University & Tencent Hunyuan)
+- **arXiv ID**: 2602.03587v1 [cs.CL]
+- **Date**: 2026-02-03
+- **Size**: 78 pages, 17 figures
+- **URL**: https://arxiv.org/abs/2602.03587
+
+## Abstract
+
+Current language models excel at reasoning over prompts using pre-trained knowledge. However, real-world tasks are far more complex and context-dependent: models must learn from task-specific context and leverage new knowledge beyond what is learned during pre-training to reason and resolve tasks. We term this capability **context learning**, a crucial ability that humans naturally possess but has been largely overlooked.
+
+To this end, we introduce CL-bench, a real-world benchmark consisting of 500 complex contexts, 1,899 tasks, and 31,607 verification rubrics, all crafted by experienced domain experts. Each task is designed such that the new content required to resolve it is contained within the corresponding context. Resolving tasks in CL-bench requires models to learn from the context, ranging from new domain-specific knowledge, rule systems, and complex procedures to laws derived from empirical data, all of which are absent from pre-training.
+
+This goes far beyond long-context tasks that primarily test retrieval or reading comprehension, and in-context learning tasks, where models learn simple task patterns via instructions and demonstrations. Our evaluations of ten frontier LMs find that models solve only 17.2% of tasks on average. Even the best-performing model, GPT-5.1, solves only 23.7%, revealing that LMs have yet to achieve effective context learning.
+
+## Key Statistics
+- 500 contexts, 1,899 tasks, 31,607 verification rubrics
+- 4 context categories → 18 subcategories
+- Average ~20 hours expert effort per context
+- Contamination-free design (fictional creation, modification, niche content)
+
+## Four Context Categories
+1. **Domain Knowledge Reasoning** (7 subcategories): Finance, Healthcare, Humanities, Legal Advisory, Lifestyle, Management, Science
+2. **Rule System Application** (5 subcategories): Game Mechanics, Mathematical Formalism, Programming Syntax, Legal & Regulatory, Technical Standards
+3. **Procedural Task Execution** (3 subcategories): Instructional, Operational, Workflow Orchestration
+4. **Empirical Discovery & Simulation** (3 subcategories): Experimental Data, Observational Data, Simulation Environment
+
+## Evaluated Models (Top 10)
+GPT-5.1, Claude Opus 4.5, GPT-5.2, o3, Kimi K2, HY 2.0, Gemini 3 Pro, Qwen 3 Max, Doubao 1.6, DeepSeek V3.2
+
+## Key Findings
+1. Context learning is a fundamental bottleneck: best model only 23.7%
+2. Performance varies dramatically across categories (Domain Knowledge: 25.3% vs Empirical Discovery: ~11%)
+3. Mathematical formalism is the hardest subcategory (<15% for most models)
+4. Legal & regulatory subcategory surprisingly tractable (>40% for GPT-5.1)
+5. Task difficulty is NOT correlated with context length — reasoning quality matters more
--- a/raw/papers/elf-embedded-language-flows-2026.md
+++ b/raw/papers/elf-embedded-language-flows-2026.md
@@ -0,0 +1,28 @@
+---
+title: "ELF: Embedded Language Flows"
+created: 2026-05-13
+updated: 2026-05-13
+type: raw-paper
+source: https://arxiv.org/abs/2605.10938
+tags: [diffusion-language-model, flow-matching, continuous-embeddings, language-generation]
+---
+
+# ELF: Embedded Language Flows
+
+**arXiv:** 2605.10938
+**Authors:** Keya Hu*, Linlu Qiu*, Yiyang Lu, Hanhong Zhao, Tianhong Li, Yoon Kim, Jacob Andreas, Kaiming He (MIT; *equal contribution)
+**Date:** 2026-05-11
+**Categories:** cs.CL, cs.AI, cs.LG
+**Code:** https://github.com/lillian039/ELF
+
+## Abstract
+
+Diffusion and flow-based models have become the de facto approaches for generating continuous data. Their success has attracted growing interest in applying them to language modeling. Unlike their image-domain counterparts, today's leading diffusion language models (DLMs) primarily operate over discrete tokens. In this paper, we show that continuous DLMs can be made effective with minimal adaptation to the discrete domain. We propose Embedded Language Flows (ELF), a class of diffusion models in continuous embedding space based on continuous-time Flow Matching. Unlike existing DLMs, ELF predominantly stays within the continuous embedding space until the final time step, where it maps to discrete tokens using a shared-weight network. This formulation makes it straightforward to adapt established techniques from image-domain diffusion models, e.g., classifier-free guidance (CFG). Experiments show that ELF substantially outperforms leading discrete and continuous DLMs, achieving better generation quality with fewer sampling steps.
+
+## Key Claims
+
+1. Continuous DLMs can match/exceed discrete DLMs with proper design — the performance gap is due to algorithmic choices, not inherent discreteness of language.
+2. **Shared-weight discretization**: A single network handles both denoising (MSE loss, t<1) and decoding (CE loss, t=1) via a binary mode token, eliminating the need for a separate decoder.
+3. **x-prediction** parameterization aligns denoising and decoding objectives, enabling effective weight sharing that v-prediction cannot support.
+4. **CFG is naturally applicable** to continuous DLMs and significantly improves generation quality; training-time CFG avoids inference overhead.
+5. ELF-B (105M) outperforms 170M baselines (MDLM, Duo, FLM, LangFlow) with **10× fewer training tokens** and **fewer sampling steps** (32 vs 1024), without distillation.
--- a/raw/papers/he-urlvr-sharpening-2026.md
+++ b/raw/papers/he-urlvr-sharpening-2026.md
@@ -0,0 +1,29 @@
+# How Far Can Unsupervised RLVR Scale LLM Training?
+
+- **arXiv ID**: 2603.08660
+- **作者**: Bingxiang He, Yuxin Zuo, Zeyuan Liu, et al. (22 authors)
+- **机构**: Tsinghua University, Shanghai AI Lab, Xi'an Jiaotong, UIUC, SJTU, Peking University, Frontis.AI
+- **日期**: 2026-03-09
+- **会议**: Accepted to ICLR 2026
+- **GitHub**: https://github.com/PRIME-RL/TTRL
+- **标签**: #RLVR #unsupervised-learning #LLM-training #reward-hacking #model-collapse
+
+## 摘要
+
+无监督可验证奖励强化学习 (URLVR) 通过无需 ground truth 标签的奖励信号扩展 LLM 训练。本文建立统一理论框架，揭示所有内在奖励方法本质上都收敛于"锐化模型初始分布 (sharpening)"——当初始置信度与正确性对齐时放大收益，错位时则灾难性失败。实验表明内在奖励始终遵循"先升后降 (rise-then-fall)"模式。提出 Model Collapse Step 作为模型先验的实用指标。最后探索基于计算不对称性的外部奖励方法（self-verification），展示其可能突破置信度-正确性天花板的初步证据。
+
+## 核心贡献
+
+1. **URLVR 分类法**: 将方法分为内在奖励 (intrinsic) 和外部奖励 (external) 两类
+2. **统一理论框架**: 证明所有内在方法收敛于锐化初始分布
+3. **Rise-then-Fall 模式**: 系统实验跨越多种方法验证统一的先升后降轨迹
+4. **Model Collapse Step**: 无需 ground truth 标签的模型先验度量，预测 RL 可训练性
+5. **外部突破路径**: Self-verification 展示持续改进而无崩溃模式
+
+## 结构
+- Sec 2: URLVR 方法分类（Certainty-based / Ensemble-based）
+- Sec 3: Sharpening 机制的理论推导
+- Sec 4: 内在 URLVR 何时有效/失败
+- Sec 5: 测试时训练中的安全应用
+- Sec 6: Model Collapse Step 指标
+- Sec 7: 外部奖励方法的突破（Self-verification）
--- a/raw/papers/hunyuan-team-cl-bench-life-2026.md
+++ b/raw/papers/hunyuan-team-cl-bench-life-2026.md
@@ -0,0 +1,40 @@
+# CL-BENCH LIFE: Can Language Models Learn From Real-Life Context?
+
+## Metadata
+- **Title**: CL-BENCH LIFE: Can Language Models Learn From Real-Life Context?
+- **Authors**: Hunyuan Team (Tencent) & Fudan University
+- **arXiv ID**: 2604.27043v1
+- **Category**: cs.CL
+- **Date**: 2026-04-29
+- **URL**: https://arxiv.org/abs/2604.27043
+
+## Abstract
+
+Today's AI assistants such as OpenClaw are designed to handle context effectively, making context learning an increasingly important capability for models. As these systems move beyond professional settings into everyday life, the nature of the contexts they must handle also shifts. Real-life contexts are often messy, fragmented, and deeply tied to personal and social experience, such as multi-party conversations, personal archives, and behavioral traces. Yet it remains unclear whether current frontier language models can reliably learn from such contexts and solve tasks grounded in them.
+
+To this end, we introduce CL-bench Life, a fully human-curated benchmark comprising 405 context-task pairs and 5,348 verification rubrics, covering common real-life scenarios. Solving tasks in CL-bench Life requires models to reason over complex, messy real-life contexts, calling for strong real-life context learning abilities that go far beyond those evaluated in existing benchmarks.
+
+We evaluate ten frontier LMs and find that real-life context learning remains highly challenging: even the best-performing model achieves only 19.3% task solving rate, while the average performance across models is only 13.8%. Models still struggle to reason over contexts such as messy group chat histories and fragmented behavioral records from everyday life.
+
+## Key Statistics
+- 405 context-task pairs
+- 5,348 verification rubrics
+- 3 context categories × 3 subcategories = 9 subcategories
+- 59.8% multi-turn interactions
+- Context length range: 5.4K – 170.8K tokens (avg 19.4K)
+
+## Three Context Categories
+1. **Communication & Social Interactions**: Private chats, group discussions, meeting transcripts, public community interactions
+2. **Fragmented Information & Revisions**: Personal notes, public information streams, creation/revision histories
+3. **Behavioral Records & Activity Trails**: Game logs, digital footprints, browsing streams, long-term daily activity records
+
+## Key Findings
+1. Real-life context learning is extremely challenging (best model 19.3%, avg 13.8%)
+2. Poor performance is NOT simply a long-context problem — solving rate doesn't strongly correlate with context length
+3. Reasoning mode improves performance but with diminishing returns; token efficiency varies dramatically across models
+4. **Context misuse** (not ignoring) is the primary failure mode — 76-84% of errors are context misuse
+5. Group chat scenarios cause identity confusion and reference resolution failures
+6. Self-tracking trajectories is the hardest subcategory (best: 10.4%)
+
+## Evaluated Models
+GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, Hy3 preview, Seed 2.0 Pro, Kimi K2.5, Qwen 3.5 Plus, Grok 4.20, DeepSeek V3.2 Thinking, MiniMax M2.5
--- a/raw/papers/laban-delegate52-2026.md
+++ b/raw/papers/laban-delegate52-2026.md
@@ -0,0 +1,30 @@
+---
+title: "LLMs Corrupt Your Documents When You Delegate"
+created: 2026-05-14
+type: paper
+source: https://arxiv.org/abs/2604.15597
+authors: ["Philippe Laban", "Tobias Schnabel", "Jennifer Neville"]
+---
+
+# LLMs Corrupt Your Documents When You Delegate
+
+- **arXiv ID**: 2604.15597
+- **Authors**: Philippe Laban, Tobias Schnabel, Jennifer Neville (Microsoft Research)
+- **Categories**: cs.CL (Computation and Language), cs.HC (Human-Computer Interaction)
+- **Published**: 2026-04-17
+- **Repository**: microsoft/DELEGATE52
+- **Dataset**: datasets/microsoft/DELEGATE52
+
+## Abstract
+
+Large Language Models (LLMs) are poised to disrupt knowledge work, with the emergence of delegated work as a new interaction paradigm (e.g., vibe coding). Delegation requires trust — the expectation that the LLM will faithfully execute the task without introducing errors into documents. We introduce DELEGATE-52 to study the readiness of AI systems in delegated workflows. DELEGATE-52 simulates long delegated workflows that require in-depth document editing across 52 professional domains, such as coding, crystallography, and music notation. Our large-scale experiment with 19 LLMs reveals that current models degrade documents during delegation: even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows, with other models failing more severely. Additional experiments reveal that agentic tool use does not improve performance on DELEGATE-52, and that degradation severity is exacerbated by document size, length of interaction, or presence of distractor files. Our analysis shows that current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction.
+
+## Key Metrics
+
+- 19 LLMs tested across 6 model families
+- 310 work environments across 52 professional domains
+- Frontier models average ~25% degradation after 20 interactions
+- All-model average ~50% degradation after 20 interactions
+- Python is the only domain where most models (17/19) achieve "ready" status
+- Critical failures account for ~80% of total degradation
+- Agentic tool use incurs 2-5x input token overhead
--- a/raw/papers/liu-koopa-2023.md
+++ b/raw/papers/liu-koopa-2023.md
@@ -0,0 +1,44 @@
+---
+title: "Koopa: Learning Non-stationary Time Series Dynamics with Koopman Predictors"
+arxiv: "2305.18803"
+venue: "NeurIPS 2023"
+authors: "Yong Liu, Chenyu Li, Jianmin Wang, Mingsheng Long (Tsinghua University)"
+type: paper
+tags: [time-series, koopman-theory, deep-learning, forecasting, non-stationary]
+---
+
+# Koopa: Learning Non-stationary Time Series Dynamics with Koopman Predictors
+
+> NeurIPS 2023 | Tsinghua University | [Code](https://github.com/thuml/Koopa)
+
+## 核心问题
+
+真实世界时间序列天然具有非平稳性（时变统计特性与时间依赖），导致训练-推理分布鸿沟，深度预测模型难以泛化。
+
+## 方法论
+
+**Koopman 理论**：将非线性动力学映射到无限维线性空间，在那个空间中动力学由线性算子 K 驱动：
+K ∘ g(x_t) = g(F(x_t)) = g(x_{t+1})
+
+**Fourier Filter**：将非平稳序列分解为时变（高频）与时不变（低频）分量，各自送入 Koopman Predictor。
+
+**Koopman Predictor**：
+- 学习测量函数 g 实现 Koopman 嵌入
+- 用线性 Koopman 算子刻画隐式状态转移
+- 上下文感知算子：在局部时间邻域计算，捕捉时变动力学的强局部性
+- 可利用真实观测滚动预测，扩展预测范围
+
+**可堆叠模块**：层级式动力学学习，每层分解 + 预测。
+
+## 核心结果
+
+- SOTA 竞争性能
+- 训练时间节省 **77.3%**，内存节省 **76.0%**（6 个真实世界基准平均）
+- 端到端预测目标优化（无需重构损失绑定）
+
+## 关键技术点
+
+1. Fourier Filter 实现时变/时不变分量解耦
+2. Koopman 算子提供隐式动力学的线性肖像
+3. 上下文感知算子处理局部时变特性
+4. 深度残差结构集成 Koopman Predictor
--- a/raw/papers/ramsey-numbers-survey-2025.md
+++ b/raw/papers/ramsey-numbers-survey-2025.md
@@ -0,0 +1,51 @@
+---
+title: "拉姆齐数的数学综述 (Ramsey Numbers: A Comprehensive Survey)"
+source: "用户上传 Markdown"
+date: 2025-06
+type: survey
+tags: [ramsey-theory, combinatorics, graph-theory, number-theory, additive-combinatorics, mathematical-logic]
+---
+
+# 拉姆齐数的数学综述
+
+> 数学概念、已知结果、应用价值与跨学科影响 | 2025年6月
+
+## 核心主旨
+
+拉姆齐理论的核心信条："完全的无序是不可能的。" 拉姆齐数精确刻画了"足够大"这一概念的数学内涵——在任何足够大的结构中，必然存在某种规则性子结构。
+
+## 历史脉络
+
+- **1928**：Frank Ramsey 发表《论形式逻辑的一个问题》，开创领域
+- **1935**：Erdős & Szekeres 重新发现，提出"幸福结局问题"
+- **1947**：Erdős 引入概率方法，获 Ramsey 数下界
+- **1975**：Szemerédi 正则性引理；Lovász 局部引理
+- **1977**：Paris-Harrington 定理——首个"自然的"不可判定命题
+- **2004**：Green-Tao 定理——素数包含任意长等差数列
+
+## 核心结果
+
+### 对角拉姆齐数 R(k)
+| k | R(k) | 备注 |
+|---|------|------|
+| 3 | 6 | 经典聚会问题 |
+| 4 | 18 | Greenwood-Gleason 1955 |
+| 5 | 43–48 | Exoo(下界), McKay-Radziszowski(上界) |
+| 6 | 102–165 | 未知精确值 |
+
+### 一般边界
+- 下界：R(k) > 2^{k/2} (Erdős 概率方法)
+- 上界：R(k) ≤ 4^k / √k (Conlon 2009)
+- 上下界指数差距（底数 √2 到 4）是核心未解问题
+
+## 关键证明方法
+1. **概率方法**：通过随机图以正概率满足性质证明存在性
+2. **构造性方法**：有限域 Paley 图等代数构造
+3. **代数/谱方法**：Conlon(2023) 用矩阵乘法改进上界
+
+## 跨学科应用
+- **计算机科学**：分布式系统容错、网络设计、强化学习搜索 Ramsey 数
+- **密码学**：随机性提取器、隐私放大协议
+- **物理学**：相变材料 GST 的 Ramsey 理论分析
+- **生物学**：基因调控网络、神经连接模式
+- **社会科学**：群体形成、社会网络分析
--- a/raw/papers/song-agent-network-taxonomy-2026.md
+++ b/raw/papers/song-agent-network-taxonomy-2026.md
@@ -0,0 +1,44 @@
+# Complex networks of AI agentic systems: topology, memory, and update dynamics
+
+## Metadata
+- **Title**: Complex networks of AI agentic systems: topology, memory, and update dynamics
+- **Authors**: Xinyuan Song (Emory), Qingsong Wen (Oxford), Shirui Pan (Griffith), Liang Zhao (Emory)
+- **DOI**: 10.36227/techrxiv.177127384.46731320/v1
+- **Type**: Survey / Preprint (TechRxiv, IEEE)
+- **Date**: 2026-02-16
+- **License**: CC BY 4.0
+- **URL**: https://www.techrxiv.org/doi/full/10.36227/techrxiv.177127384.46731320/v1
+
+## Abstract
+
+Large-scale networks of agents are increasingly applied to software engineering, scientific analysis, web automation, organizational workflows, and social simulation, yet existing multi-agent architectures lack a unified framework to explain why some designs scale to long-horizon, multi-step tasks while others fail. As these systems grow, their behavior is fundamentally shaped by how agents are connected, how information is stored, and how states are updated over time.
+
+In this survey, we introduce a hierarchical taxonomy of agent systems along three core dimensions—architecture topology (centralized vs. decentralized), memory scope (global vs. local), and update behavior (static vs. dynamic)—which together induce eight system categories that organize prior work and make architectural trade-offs explicit. Using this taxonomy, we analyze how design choices influence scalability, coordination efficiency, communication overhead, planning depth, and robustness under partial failure, and we identify common failure modes and open challenges, including consistency management, agent routing, federation boundaries, and stability under noise or disruption.
+
+## Key Contributions
+
+1. **Formal definition**: Agent system as A = (V, E, M, Π) — agents, communication graph, memory configuration, policies
+2. **Hierarchical taxonomy**: 3 nested dimensions → 8 system categories
+3. **Communication stack**: Transport → Structural (Function Calling) → Semantic layer
+4. **MCP integration**: Model Context Protocol as standardized substrate for large-scale agent networks
+
+## Eight System Categories
+
+| Topology | Memory | Update | Representative Systems |
+|----------|--------|--------|----------------------|
+| Centralized | Global | Static | MetaGPT, ChatDev, AutoGen, HuggingGPT |
+| Centralized | Global | Dynamic | SWE-agent, OpenHands, Voyager, Multi-Agent Debate |
+| Centralized | Local | Static | MetaAgent, YuLan-OneSim, SOTOPIA-S4 |
+| Centralized | Local | Dynamic | OPTIMA, Magentic-One, G-Designer |
+| Decentralized | Global | Static | BlackBoard, LLMBlackBoard, Memory Sharing |
+| Decentralized | Global | Dynamic | GPTSwarm, AgentSociety, OpenAgents |
+| Decentralized | Local | Static | MMAgent, WebArena, TalkHier |
+| Decentralized | Local | Dynamic | Generative Agents, 1000-Person Sims, AgentNet, SOTOPIA-S |
+
+## Key Challenges Identified
+1. High communication load with agent count growth
+2. Context propagation and drift under distributed execution
+3. Ordering and concurrency in asynchronous systems
+4. Interpretation mismatch across heterogeneous agent models
+5. Update instability from concurrent state modifications
+6. Security and trust as attack surface expands
--- a/raw/papers/xiao-streaming-llm-2024.md
+++ b/raw/papers/xiao-streaming-llm-2024.md
@@ -0,0 +1,33 @@
+---
+arxiv: "2309.17453"
+title: "Efficient Streaming Language Models with Attention Sinks"
+authors: ["Guangxuan Xiao", "Yuandong Tian", "Beidi Chen", "Song Han", "Mike Lewis"]
+venue: "ICLR 2024"
+affiliations: ["MIT", "Meta AI", "CMU", "NVIDIA"]
+year: 2024
+url: "https://arxiv.org/abs/2309.17453"
+code: "https://github.com/mit-han-lab/streaming-llm"
+type: paper
+---
+
+# Efficient Streaming Language Models with Attention Sinks
+
+## Abstract
+
+Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue poses two major challenges. Firstly, during the decoding stage, caching previous tokens' Key and Value states (KV) consumes extensive memory. Secondly, popular LLMs cannot generalize to longer texts than the training sequence length. Window attention, where only the most recent KVs are cached, is a natural approach — but we show that it fails when the text length surpasses the cache size. We observe an interesting phenomenon, namely **attention sink**, that keeping the KV of initial tokens will largely recover the performance of window attention. We first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a "sink" even if they are not semantically important. Based on the above analysis, we introduce **StreamingLLM**, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence length without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2× speedup.
+
+## Key Contributions
+
+1. **Attention Sink Discovery**: Initial tokens receive disproportionately high attention scores across all layers and heads, not due to semantics but due to their absolute position — they serve as "sinks" for excess attention that the SoftMax function forces to be allocated somewhere.
+
+2. **StreamingLLM Framework**: A simple, training-free method that keeps attention sink tokens' KV (just 4 initial tokens suffice) together with a sliding window of recent tokens, enabling infinite-length streaming inference.
+
+3. **Sink Token Pre-training**: Demonstrates that pre-training with a dedicated learnable sink token allows models to use a single token as the attention sink, eliminating the need for multiple initial tokens.
+
+4. **Universal Validation**: Tested across Llama-2 (7/13/70B), MPT (7/30B), Falcon (7/40B), Pythia (2.9/6.9/12B) with both RoPE and ALiBi position encodings, achieving stable perplexity on up to 4M tokens.
+
+## Core Mechanism
+
+The SoftMax function in attention requires all attention scores to sum to 1. When the current query has no strong semantic match, the model still needs to allocate residual attention values somewhere. Initial tokens, being visible to all subsequent tokens (due to autoregressive nature), become naturally trained as attention sinks.
+
+StreamingLLM's KV cache has two components: (1) **Attention Sinks** (4 initial tokens) for stable attention computation, and (2) **Rolling KV Cache** (most recent tokens) for language modeling. Positions are assigned within the cache rather than the original text, which is crucial for performance.