myWiki/raw/papers/hunyuan-team-cl-bench-life-2026.md

# CL-BENCH LIFE: Can Language Models Learn From Real-Life Context?

## Metadata
- **Title**: CL-BENCH LIFE: Can Language Models Learn From Real-Life Context?
- **Authors**: Hunyuan Team (Tencent) & Fudan University
- **arXiv ID**: 2604.27043v1
- **Category**: cs.CL
- **Date**: 2026-04-29
- **URL**: https://arxiv.org/abs/2604.27043

## Abstract

Today's AI assistants such as OpenClaw are designed to handle context effectively, making context learning an increasingly important capability for models. As these systems move beyond professional settings into everyday life, the nature of the contexts they must handle also shifts. Real-life contexts are often messy, fragmented, and deeply tied to personal and social experience, such as multi-party conversations, personal archives, and behavioral traces. Yet it remains unclear whether current frontier language models can reliably learn from such contexts and solve tasks grounded in them.

To this end, we introduce CL-bench Life, a fully human-curated benchmark comprising 405 context-task pairs and 5,348 verification rubrics, covering common real-life scenarios. Solving tasks in CL-bench Life requires models to reason over complex, messy real-life contexts, calling for strong real-life context learning abilities that go far beyond those evaluated in existing benchmarks.

We evaluate ten frontier LMs and find that real-life context learning remains highly challenging: even the best-performing model achieves only 19.3% task solving rate, while the average performance across models is only 13.8%. Models still struggle to reason over contexts such as messy group chat histories and fragmented behavioral records from everyday life.

## Key Statistics
- 405 context-task pairs
- 5,348 verification rubrics
- 3 context categories × 3 subcategories = 9 subcategories
- 59.8% multi-turn interactions
- Context length range: 5.4K – 170.8K tokens (avg 19.4K)

## Three Context Categories
1. **Communication & Social Interactions**: Private chats, group discussions, meeting transcripts, public community interactions
2. **Fragmented Information & Revisions**: Personal notes, public information streams, creation/revision histories
3. **Behavioral Records & Activity Trails**: Game logs, digital footprints, browsing streams, long-term daily activity records

## Key Findings
1. Real-life context learning is extremely challenging (best model 19.3%, avg 13.8%)
2. Poor performance is NOT simply a long-context problem — solving rate doesn't strongly correlate with context length
3. Reasoning mode improves performance but with diminishing returns; token efficiency varies dramatically across models
4. **Context misuse** (not ignoring) is the primary failure mode — 76-84% of errors are context misuse
5. Group chat scenarios cause identity confusion and reference resolution failures
6. Self-tracking trajectories is the hardest subcategory (best: 10.4%)

## Evaluated Models
GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, Hy3 preview, Seed 2.0 Pro, Kimi K2.5, Qwen 3.5 Plus, Grok 4.20, DeepSeek V3.2 Thinking, MiniMax M2.5