This commit is contained in:
2026-06-01 10:46:01 +08:00
parent 2faf4bb002
commit e96b955fda
221 changed files with 10219 additions and 332 deletions

View File

@@ -0,0 +1,61 @@
---
title: "The Bayesian Geometry of Transformer Attention"
authors: "Naman Agarwal, Siddhartha R. Dalal, Vishal Misra"
arxiv: "2512.22471"
venue: "arXiv (cs.LG)"
date: "2026-05"
type: "paper"
series: "Bayesian Attention Trilogy, Paper I"
---
# The Bayesian Geometry of Transformer Attention
**Paper I of the Bayesian Attention Trilogy**
**Authors**: Naman Agarwal (Dream Sports → Google DeepMind), Siddhartha R. Dalal (Columbia), Vishal Misra (Columbia)
## TL;DR
Small transformers achieve exact Bayesian posteriors (10⁻³10⁻⁴ bit accuracy) in **Bayesian wind tunnels** — controlled environments where the true posterior is known in closed form and memorization is provably impossible. MLPs fail by orders of magnitude.
## Core Framework: Bayesian Wind Tunnels
Controlled prediction tasks where:
1. Analytic posterior is known exactly at each step
2. Hypothesis space is too large for memorization
3. In-context prediction requires genuine probabilistic inference
Converts "does it do Bayes?" into a quantitative test: **does the model's predictive entropy match the analytic posterior entropy?**
## Three Inference Primitives
| Primitive | Definition | Required for |
|-----------|-----------|-------------|
| Belief Accumulation | Integrating evidence into running posterior | Bijection learning, HMM |
| Belief Transport | Propagating beliefs through stochastic dynamics | HMM filtering |
| Random-Access Binding | Retrieving by content, not position | Associative recall |
## Architectural Realizability
| Architecture | Accumulation | Transport | Binding | Status |
|-------------|:---:|:---:|:---:|--------|
| Transformer | ✅ | ✅ | ✅ | Full primitive completeness |
| Mamba (SSM) | ✅ | ✅ | ❌ | SOTA on HMM filtering; fails binding |
| LSTM | ✅ | ❌ | ❌ | Only static sufficient statistics |
| MLP | ❌ | ❌ | ❌ | Fails uniformly |
## Key Geometric Findings
- **Orthogonal key bases** in attention heads
- **Low-dimensional value manifold** parameterized by posterior entropy
- Mamba's final layer organizes into **5 clusters** — one per HMM hidden state (corner geometry of belief simplex)
## Structural Theorem
> The dominance of transformers in reasoning tasks arises not from scale alone, but from **primitive completeness**: they are the minimal architecture realizing the full set of inference primitives.
## Trilogy Context
- **Paper I** (this): Existence + internal geometry of exact Bayesian inference in transformers
- **Paper II**: Bayesian geometry arises generically from gradient dynamics under cross-entropy
- **Paper III**: How primitives compose in partially observed settings (closer to natural language)

View File

@@ -0,0 +1,27 @@
---
source_url: user-upload
ingested: 2026-05-23
sha256: unknown
---
# Agent Harness Engineering: A Survey
## Metadata
- **Authors**: Junjie Li^1,6^*, Xi Xiao^6^*, Yunbei Zhang^5^*, Chen Liu^2^*, Lin Zhao^4, Xiaoying Liao^3, Yingrui Ji^6, Janet Wang^6, Jianyang Gu^7, Yingqiang Ge^9, Weijie Xu^9, Xi Fang^9, Xiang Xu^9, Tianchen Zhao^9, Youngeun Kim^9, Tianyang Wang^6, Jihun Hamm^5, Smita Krishnaswamy^2, Jun Huan^9, Chandan K Reddy^8,9
- **Institutions**: 1 CMU, 2 Yale, 3 JHU, 4 NEU, 5 Tulane, 6 UAB, 7 OSU, 8 Virginia Tech, 9 Amazon
- **Venue**: Under review at TMLR (Transactions on Machine Learning Research), 2026
- **Project Page**: Awesome-Agent-Harness
## Abstract
The rapid deployment of large language model (LLM) agents in production has revealed a recurring pattern: task execution reliability depends less on the underlying model than on the infrastructure layer that wraps it — the **agent execution harness**. This survey provides a practice-grounded, systematic treatment of agent harness engineering, organized around three claims:
1. **Binding-Constraint Thesis**: The agent harness is an independent system layer whose engineering quality drives a large share of real-world reliability
2. **ETCLOVG Taxonomy**: A seven-layer taxonomy (Execution environment, Tool interface, Context management, Lifecycle/Orchestration, Observability, Verification, Governance)
3. **Ecosystem Mapping**: 170+ open-source projects mapped onto this taxonomy
## Key Contributions
- Three-phase engineering evolution: Prompt → Context → Harness Engineering
- Cross-layer synthesis: Cost-Quality-Speed Trilemma, Capability-Control Tradeoff, Harness Coupling Problem
- Open-problem agenda spanning harden/scale execution, maintain reliable state, diagnose from traces, standardize handoffs, and adaptive simplification

Binary file not shown.

View File

@@ -0,0 +1,23 @@
---
source_url: https://arxiv.org/abs/2605.19376
ingested: 2026-05-23
sha256: unknown
---
# Generative Recursive Reasoning
- **Authors**: Junyeob Baek^1*, Mingyu Jo^1*, Minsu Kim^1,2, Mengye Ren^3, Yoshua Bengio^2,4, Sungjin Ahn^1,3†
- **Institutions**: 1 KAIST, 2 Mila Québec AI Institute, 3 New York University, 4 Université de Montréal
- **arXiv**: 2605.19376 (v2, 2026-05-19)
- **Category**: cs.AI
- **Project Page**: https://ahn-ml.github.io/gram-website
## Abstract
How should future neural reasoning systems implement extended computation? Recursive Reasoning Models (RRMs) offer a promising alternative to autoregressive sequence extension by performing iterative latent-state refinement with shared transition functions. Yet existing RRMs are largely deterministic, following a single latent trajectory and converging to a single prediction. GRAM turns recursive latent reasoning into probabilistic multi-trajectory computation, treating reasoning as a stochastic latent trajectory that enables multiple hypotheses, alternative solution strategies, and inference-time scaling through both recursive depth and parallel trajectory sampling. This yields a latent-variable generative model supporting both conditional reasoning p_θ(y|x) and unconditional generation p_θ(x). Trained with amortized variational inference, GRAM improves over deterministic recurrent and recursive baselines on structured reasoning and multi-solution constraint satisfaction tasks.
## Key Contributions
1. Formulates recursive reasoning as a latent-variable generative process
2. Introduces width-based inference-time scaling (depth + parallel trajectories)
3. Empirical evidence on Sudoku-Extreme, ARC-AGI, N-Queens, Graph Coloring, binarized MNIST

View File

@@ -0,0 +1,44 @@
---
title: "ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents"
created: 2026-05-12
type: paper
source: https://arxiv.org/abs/2605.12481
code: https://github.com/X-PLUG/ToolCUA
authors:
- Xuhao Hu (Fudan)
- Xi Zhang (Alibaba)
- Haiyang Xu (Alibaba)
- Kyle Qiao (Alibaba)
- Jingyi Yang (Fudan)
- Xuanjing Huang (Fudan)
- Jing Shao (Shanghai AI Lab)
- Ming Yan (Alibaba)
- Jieping Ye (Alibaba)
venue: arXiv:2605.12481, 2026
tags: [computer-use-agents, gui-tool-orchestration, reinforcement-learning, trajectory-optimization]
---
# ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
**Authors**: Xuhao Hu, Xi Zhang, Haiyang Xu, Kyle Qiao, Jingyi Yang, Xuanjing Huang, Jing Shao, Ming Yan, Jieping Ye
**Affiliations**: Tongyi Lab, Alibaba Group; Fudan University; Shanghai Artificial Intelligence Laboratory
**arXiv**: 2605.12481 | **Date**: May 12, 2026 | **Code**: https://github.com/X-PLUG/ToolCUA
## Abstract
Computer Use Agents (CUAs) can act through both atomic GUI actions, such as click and type, and high-level tool calls, such as API-based file operations, but this hybrid action space often leaves them uncertain about when to continue with GUI actions or switch to tools, leading to suboptimal execution paths. This difficulty stems from the scarcity of high-quality interleaved GUI-Tool trajectories, the cost and brittleness of collecting real tool trajectories, and the lack of trajectory-level supervision for GUI-Tool path selection. In this paper, we propose ToolCUA, an end-to-end agent designed to learn optimal GUI-Tool path selection through a staged training paradigm. We first introduce an Interleaved GUI-Tool Trajectory Scaling Pipeline that repurposes abundant static GUI trajectories and synthesizes a grounded tool library, enabling diverse GUI-Tool trajectories without manual engineering or real tool-trajectory collection. We then perform Tool-Bootstrapped GUI RFT, combining warmup SFT with single-turn RL to improve decisions at critical GUI-Tool switching points. Finally, we optimize ToolCUA with Online Agentic RL in a high-fidelity GUI-Tool environment, guided by a Tool-Efficient Path Reward that encourages appropriate tool use and shorter execution paths. Experiments on OSWorld-MCP show that ToolCUA achieves 46.85% accuracy, a relative improvement of approximately 66% over the baseline, establishing a new state of the art among models of comparable scale.
## Key Concepts
- [[computer-use-agents|Computer Use Agents (CUAs)]]
- [[gui-tool-hybrid-action-space|GUI-Tool Hybrid Action Space]]
- [[optimal-gui-tool-path-selection]]
- [[interleaved-gui-tool-trajectory-scaling]]
- [[tool-bootstrapped-rft]]
- [[tool-efficient-path-reward]]
- [[osworld-mcp]]
- [[next-state-grounding]]
- [[grpo]]
- [[agent-computer-interface]]

View File

@@ -0,0 +1,39 @@
---
title: "KORE: Enhancing Knowledge Injection for Large Multimodal Models via Knowledge-Oriented Controls"
authors:
- Kailin Jiang
- Hongbo Jiang
- Ning Jiang
- Zhi Gao
- Jinhe Bi
- Yuchen Ren
- Bin Li
- Yuntao Du
- Lei Liu
- Qing Li
date: 2026
arxiv: "2510.19316"
venue: "ICML 2026"
domain: "Multimodal Learning, Knowledge Injection, Continual Learning"
type: paper
source: "https://arxiv.org/abs/2510.19316"
---
# KORE: Enhancing Knowledge Injection for Large Multimodal Models via Knowledge-Oriented Controls
**Authors**: Kailin Jiang, Hongbo Jiang, Ning Jiang, Zhi Gao, Jinhe Bi, Yuchen Ren, Bin Li, Yuntao Du, Lei Liu, Qing Li
**Venue**: ICML 2026
**arXiv**: 2510.19316
## Abstract
KORE is a synergistic method centered around Knowledge-Oriented Controls for injecting new knowledge into LMMs while preserving old knowledge. It implements a two-stage optimization: (1) KORE-AUGMENTATION converts individual knowledge items into structured multi-round dialogues and instruction tasks, building a "knowledge tree" that enables internalization; (2) KORE-CONSTRAINT stores previous knowledge in the covariance matrix of linear layer activations and initializes a LoRA adapter by projecting original weights into the matrix's null space, defining a fine-tuning direction that minimally interferes with previous knowledge.
## Key Contributions
1. **KORE-AUGMENTATION**: Structured knowledge augmentation pipeline — multi-round dialogues (trunk) + instruction tasks (branches) = knowledge tree
2. **KORE-CONSTRAINT**: Null space projection via covariance matrix SVD — freezes adapter A in null space, fine-tunes only B
3. **HARS metric**: Harmonized Adaptation-Retention Score for unified evaluation
4. **State-of-the-art**: Outperforms 9 baselines on EVOKE benchmark across LLaVA-v1.5 (7B/13B) and Qwen2.5-VL (7B)

View File

@@ -0,0 +1,28 @@
---
title: "AutoHarness: improving LLM agents by automatically synthesizing a code harness"
created: 2026-05-29
type: paper-raw
arxiv: "2603.03329"
authors: ["Xinghua Lou", "Miguel Lázaro-Gredilla", "Antoine Dedieu", "Carter Wendelken", "Wolfgang Lehrach", "Kevin P. Murphy"]
venue: "arXiv preprint (cs.CL), February 2026"
affiliation: "Google DeepMind"
tags: ["agent", "code-synthesis", "game-playing", "harness", "LLM"]
---
# AutoHarness: improving LLM agents by automatically synthesizing a code harness
**Authors:** Xinghua Lou, Miguel Lázaro-Gredilla, Antoine Dedieu, Carter Wendelken, Wolfgang Lehrach, Kevin P. Murphy
**Affiliation:** Google DeepMind
**arXiv:** [2603.03329](https://arxiv.org/abs/2603.03329) (v1, 10 February 2026)
**Category:** cs.CL (Computation and Language)
## Abstract
Despite significant strides in language models in the last few years, when used as agents, such models often try to perform actions that are not just suboptimal for a given state, but are strictly prohibited by the external environment. For example, in the recent Kaggle GameArena chess competition, 78% of Gemini-2.5-Flash losses were attributed to illegal moves. Often people manually write "harnesses" around LLMs to prevent such failures. In this paper, we demonstrate that Gemini-2.5-Flash can automatically synthesize such a code harness, using a small number of rounds of iterative code refinement given feedback from the (game) environment. The resulting harness prevents all illegal moves in 145 different TextArena games (both 1-player and 2-player), enabling the smaller Gemini-2.5-Flash model to outperform larger models, such as Gemini-2.5-Pro. Pushing our technique to the limit, we can get Gemini-2.5-Flash to generate the entire policy in code, thus eliminating the need to use the LLM at decision making time. The resulting code-policy receives a higher average reward than Gemini-2.5-Pro and GPT-5.2-High on 16 TextArena 1-player games.
## Key Contributions
1. **Code-as-Harness framework**: LLM synthesizes its own harness — transforms agent from LLM+hand-coded-plumbing to LLM+auto-generated-code
2. **Thompson Sampling tree search**: structured exploration of code harness space
3. **Three harness modes**: action-filter, action-verifier, and code-as-policy (zero LLM at inference)
4. **100% legal moves** across 145 TextArena games; Flash+Harness outperforms Pro

View File

@@ -0,0 +1,28 @@
---
title: "Efficient Pre-Training with Token Superposition"
created: 2026-05-29
type: paper-raw
arxiv: "2605.06546"
authors: ["Bowen Peng", "Théo Gigant", "Jeffrey Quesnelle"]
venue: "arXiv preprint (cs.CL), v2, May 2026"
affiliation: "Nous Research"
tags: ["pre-training", "efficiency", "token-superposition", "LLM"]
---
# Efficient Pre-Training with Token Superposition
**Authors:** Bowen Peng*, Théo Gigant*, Jeffrey Quesnelle (* equal contribution)
**Affiliation:** Nous Research
**arXiv:** [2605.06546](https://arxiv.org/abs/2605.06546) (v2, 19 May 2026)
**Category:** cs.CL (Computation and Language)
## Abstract
Pre-training of Large Language Models is often prohibitively expensive and inefficient at scale, requiring complex and invasive modifications in order to achieve high data throughput. In this work, we present Token-Superposition Training (TST), a simple drop-in method that significantly improves the data throughput per FLOPs during pre-training without modifying the parallelism, optimizer, tokenizer, data, or model architecture. TST is done in two phases: (i) A highly efficient superposition phase where we combine many contiguous tokens into one bag and train using a multi-hot cross-entropy (MCE) objective, and (ii) a recovery phase where we revert back to standard training. We extensively evaluate TST on the scale of 270M and 600M parameters and validate on 3B and a 10B A1B mixture of experts model, demonstrating that it is highly robust in different settings. Ultimately, TST consistently outperforms baseline loss and downstream evaluations, and under equal-loss settings, TST yields up to a 2.5x reduction in total pre-training time at the 10B A1B scale.
## Key Contributions
1. **Token-Superposition Training (TST)**: A two-phase drop-in method that increases token throughput s× per FLOP without modifying model architecture
2. **Multi-hot Cross-Entropy (MCE)**: Novel loss function for predicting bags of tokens simultaneously
3. **Representation Alignment Hypothesis**: Shared embeddings across phases are critical — re-initialization destroys gains
4. **Extensive scaling validation**: 270M→600M→3B→10B A1B MoE, with 2.5× speedup at largest scale

View File

@@ -0,0 +1,27 @@
---
title: "Pre-train Space Reinforcement Learning: From P(y|x) to P(y)"
arxiv: "2604.14142"
authors: ["Yuqiao Tan", "Minzheng Wang", "Bo Liu", "Zichen Liu", "Tian Liang", "Shizhu He", "Jun Zhao", "Kang Liu"]
venue: "arXiv preprint"
date: "2026-04-15"
type: paper
tags: ["reinforcement-learning", "pre-training", "LLM", "reasoning", "GRPO"]
---
# Pre-train Space Reinforcement Learning
> **arXiv**: [2604.14142](https://arxiv.org/abs/2604.14142)
> **Authors**: Yuqiao Tan¹²*, Minzheng Wang¹²*, Bo Liu³, Zichen Liu³, Tian Liang⁴, Shizhu He¹²†, Jun Zhao¹², Kang Liu¹²
> **Affiliations**: ¹ CASIA, ² UCAS, ³ NUS, ⁴ Tencent AI Lab
> * Equal contribution, † Corresponding author
## Abstract
While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its potential is fundamentally bounded by the base model's existing output distribution. Optimizing the marginal distribution P(y) in the Pre-train Space addresses this bottleneck by encoding reasoning ability and preserving broad exploration capacity. Yet, conventional pre-training relies on static corpora for passive learning, leading to a distribution shift that hinders targeted reasoning enhancement. In this paper, we introduce PreRL (Pre-train Space RL), which applies reward-driven online updates directly to P(y). We theoretically and empirically validate the strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a viable surrogate for standard RL. Furthermore, we uncover a critical mechanism: Negative Sample Reinforcement (NSR) within PreRL serves as an exceptionally effective driver for reasoning. NSR-PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition and reflection thoughts by 14.89× and 6.54×, respectively. Leveraging these insights, we propose Dual Space RL (DSRL), a Policy Reincarnation strategy that initializes models with NSR-PreRL to expand the reasoning horizon before transitioning to standard RL for fine-grained optimization.
## Key Claims
1. **Gradient Alignment**: <∇log P(y), ∇log P(y|x)> ≥ 0 for all samples (empirically validated), confirming PreRL as a viable surrogate for standard RL
2. **NSR > PSR in Pre-train Space**: Negative Sample Reinforcement (suppressing incorrect paths) is far more effective than Positive Sample Reinforcement in the pre-train space
3. **DSRL outperforms GRPO**: Dual Space RL achieves +2-5 point improvement on benchmarks like AIME24/25, with 1.6×-2.5× sample efficiency
4. **NSR-PreRL stimulates endogenous reasoning**: 14.89× more transition thoughts, 6.54× more reflection thoughts

View File

@@ -0,0 +1,40 @@
---
title: "When Large Multimodal Models Confront Evolving Knowledge: Challenges and Explorations"
authors:
- Kailin Jiang
- Yuntao Du
- Yukai Ding
- Yuchen Ren
- Zhi Gao
- Zilong Zheng
- Ning Jiang
- Lei Liu
- Bin Li
- Qing Li
date: 2026
arxiv: "2505.24449"
venue: "ICLR 2026"
domain: "Multimodal Learning, Knowledge Injection, Continual Learning"
type: paper
source: "https://arxiv.org/abs/2505.24449"
---
# When Large Multimodal Models Confront Evolving Knowledge: Challenges and Explorations
**Authors**: Kailin Jiang, Yuntao Du, Yukai Ding, Yuchen Ren, Zhi Gao, Zilong Zheng, Ning Jiang, Lei Liu, Bin Li, Qing Li
**Venue**: ICLR 2026
**arXiv**: 2505.24449
## Abstract
Large Multimodal Models (LMMs) store vast amounts of pretrained knowledge but struggle to remain aligned with real-world updates, making it difficult to avoid capability degradation when acquiring evolving knowledge. Furthermore, most current work focuses on exploring static textual knowledge injection, neglecting dynamic multimodal evolving knowledge injection. To address this, the authors propose MME VOKE, a benchmark for evaluating LMMs' ability in multimodal evolving knowledge injection, containing 9,422 samples spanning 159 subtypes. Through extensive experiments, they reveal challenges such as poor injection performance and capability degradation, and introduce knowledge augmentation and knowledge retention methods to address these challenges.
## Key Contributions
1. **MMEVOKE Benchmark**: First multimodal evolving knowledge injection benchmark with self-evolving data construction pipeline
2. **Dual Challenge Identification**: Poor knowledge adaptation AND capability degradation after injection
3. **Knowledge-Aware Augmentation**: Demonstrates semantic augmentation strengthens adaptation while surface-level augmentation is detrimental
4. **Retention Methods**: Data Replay and MoELoRA effectively mitigate degradation; EWC/LwF fail
5. **Sufficient Context Paradox**: Even with all necessary information, LMMs still produce incorrect answers

View File

@@ -0,0 +1,28 @@
---
title: "SkillOpt: Executive Strategy for Self-Evolving Agent Skills"
created: 2026-05-29
type: paper-raw
arxiv: "2605.23904"
authors: ["Yifan Yang", "Ziyang Gong", "Weiquan Huang", "Qihao Yang", "Ziwei Zhou", "Zisu Huang", "Yan Li", "Xuemei Gao", "Qi Dai", "Bei Liu", "Kai Qiu", "Yuqing Yang", "Dongdong Chen", "Xue Yang", "Chong Luo"]
venue: "arXiv preprint (cs.AI), v2, May 2026"
affiliation: "Microsoft, Shanghai Jiao Tong University, Tongji University, Fudan University"
tags: ["agent", "skill", "optimization", "text-space", "self-evolving"]
---
# SkillOpt: Executive Strategy for Self-Evolving Agent Skills
**Authors:** Yifan Yang*, Ziyang Gong*, Weiquan Huang*, Qihao Yang*, Ziwei Zhou*, Zisu Huang*, Yan Li, Xuemei Gao, Qi Dai, Bei Liu, Kai Qiu, Yuqing Yang, Dongdong Chen, Xue Yang, Chong Luo (* equal contribution)
**Affiliation:** Microsoft, SJTU, Tongji, Fudan
**arXiv:** [2605.23904](https://arxiv.org/abs/2605.23904) (v2, 25 May 2026)
**Code:** https://aka.ms/SkillOpt
## Abstract
Agent skills today are hand-crafted, generated one-shot, or evolved through loosely controlled self-revision—none of which behaves like a deep-learning optimizer for the skill, and none of which reliably improves over its starting point under feedback. We argue the skill should instead be trained as the external state of a frozen agent. SkillOpt is the first systematic controllable text-space optimizer for agent skills: a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update make skill training stable while adding zero inference-time model calls at deployment. Across six benchmarks, seven target models, and three execution harnesses, SkillOpt is best or tied on all 52 evaluated cells and beats every per-cell competitor. On GPT-5.5 it lifts the average no-skill accuracy by +23.5 points in direct chat, by +24.8 inside Codex, and by +19.1 inside Claude Code.
## Key Contributions
1. **Text-space optimizer**: First systematic optimizer for agent skills with deep-learning-style controls (learning rate, validation gate, momentum)
2. **52/52 best/tied**: Across 6 benchmarks × 7 models × 3 harnesses
3. **Cross-domain transfer**: Skills trained on one model/harness/benchmark transfer positively to others
4. **Compact artifacts**: 3002,000 tokens after 14 accepted edits

View File

@@ -0,0 +1,27 @@
---
title: "Symbolic Learning Enables Self-Evolving Agents"
created: 2026-05-29
type: paper-raw
arxiv: "2406.18532"
authors: ["Wangchunshu Zhou", "Yixin Ou", "Shengwei Ding", "Long Li", "Jialong Wu", "Tiannan Wang", "Jiamin Chen", "Shuai Wang", "Xiaohua Xu", "Ningyu Zhang", "Huajun Chen", "Yuchen Eleanor Jiang"]
venue: "arXiv preprint (cs.CL), June 2024"
affiliation: "AIWaves Inc."
tags: ["agent", "symbolic-learning", "self-evolving", "optimization"]
---
# Symbolic Learning Enables Self-Evolving Agents
**Authors:** Zhou et al. (AIWaves, 2024)
**arXiv:** [2406.18532](https://arxiv.org/abs/2406.18532)
**Code:** https://github.com/aiwaves-cn/agents
## Abstract
The AI community has been exploring a pathway to AGI by developing "language agents". A fundamental limitation is that current agent research is model-centric/engineering-centric — progress requires substantial manual engineering. Agent symbolic learning introduces a systematic framework that enables language agents to optimize themselves in a data-centric way using symbolic optimizers. Agents are considered as symbolic networks where learnable weights are defined by prompts, tools, and pipeline structure.
## Key Contributions
1. **Agent as Symbolic Network**: Pipeline = computation graph, Nodes = layers, Prompts/Tools = weights
2. **Symbolic Back-Propagation**: Language Loss propagated backward through the pipeline → Language Gradients for each node
3. **Holistic Joint Optimization**: All symbolic components optimized together, avoiding local optimum
4. **Self-Evolving**: Language Loss doesn't need ground-truth, enabling learning after deployment