myWiki/raw/papers/dou-cl-bench-2026.md

# CL-bench: A Benchmark for Context Learning

## Metadata
- **Title**: CL-bench: A Benchmark for Context Learning
- **Authors**: Shihan Dou, Ming Zhang, Zhangyue Yin, Chenhao Huang et al. (27 authors from Fudan University & Tencent Hunyuan)
- **arXiv ID**: 2602.03587v1 [cs.CL]
- **Date**: 2026-02-03
- **Size**: 78 pages, 17 figures
- **URL**: https://arxiv.org/abs/2602.03587

## Abstract

Current language models excel at reasoning over prompts using pre-trained knowledge. However, real-world tasks are far more complex and context-dependent: models must learn from task-specific context and leverage new knowledge beyond what is learned during pre-training to reason and resolve tasks. We term this capability **context learning**, a crucial ability that humans naturally possess but has been largely overlooked.

To this end, we introduce CL-bench, a real-world benchmark consisting of 500 complex contexts, 1,899 tasks, and 31,607 verification rubrics, all crafted by experienced domain experts. Each task is designed such that the new content required to resolve it is contained within the corresponding context. Resolving tasks in CL-bench requires models to learn from the context, ranging from new domain-specific knowledge, rule systems, and complex procedures to laws derived from empirical data, all of which are absent from pre-training.

This goes far beyond long-context tasks that primarily test retrieval or reading comprehension, and in-context learning tasks, where models learn simple task patterns via instructions and demonstrations. Our evaluations of ten frontier LMs find that models solve only 17.2% of tasks on average. Even the best-performing model, GPT-5.1, solves only 23.7%, revealing that LMs have yet to achieve effective context learning.

## Key Statistics
- 500 contexts, 1,899 tasks, 31,607 verification rubrics
- 4 context categories → 18 subcategories
- Average ~20 hours expert effort per context
- Contamination-free design (fictional creation, modification, niche content)

## Four Context Categories
1. **Domain Knowledge Reasoning** (7 subcategories): Finance, Healthcare, Humanities, Legal Advisory, Lifestyle, Management, Science
2. **Rule System Application** (5 subcategories): Game Mechanics, Mathematical Formalism, Programming Syntax, Legal & Regulatory, Technical Standards
3. **Procedural Task Execution** (3 subcategories): Instructional, Operational, Workflow Orchestration
4. **Empirical Discovery & Simulation** (3 subcategories): Experimental Data, Observational Data, Simulation Environment

## Evaluated Models (Top 10)
GPT-5.1, Claude Opus 4.5, GPT-5.2, o3, Kimi K2, HY 2.0, Gemini 3 Pro, Qwen 3 Max, Doubao 1.6, DeepSeek V3.2

## Key Findings
1. Context learning is a fundamental bottleneck: best model only 23.7%
2. Performance varies dramatically across categories (Domain Knowledge: 25.3% vs Empirical Discovery: ~11%)
3. Mathematical formalism is the hardest subcategory (<15% for most models)
4. Legal & regulatory subcategory surprisingly tractable (>40% for GPT-5.1)
5. Task difficulty is NOT correlated with context length — reasoning quality matters more