20260420:first commit

2026-04-20 11:42:41 +08:00
commit dd8345a6ea
45 changed files with 2366 additions and 0 deletions
--- a/concepts/computerized-adaptive-testing.md
+++ b/concepts/computerized-adaptive-testing.md
@@ -0,0 +1,120 @@
+---
+title: Computerized Adaptive Testing (CAT)
+created: 2026-04-17
+updated: 2026-04-17
+type: concept
+tags: [machine-learning, benchmark]
+sources: [raw/papers/zhuang-catsurvey-ml-2024.md]
+---
+
+# Computerized Adaptive Testing (CAT)
+
+## Definition
+Computerized Adaptive Testing (CAT) 是一种动态测评范式：系统根据考生实时表现，自适应地调整后续题目难度，以最少的题量实现对个体能力的高精度评估。相比传统固定试卷测试，CAT 题量更少、测量精度更高。
+
+## 核心组件
+
+CAT 系统由四个关键模块组成：
+
+### 1. Measurement Models (测量模型)
+- **传统方法：** Item Response Theory (IRT) — 基于项目反应理论的概率模型，假设题目难度与考生能力之间存在 S 型响应曲线
+- **ML 方法：** 神经网络、深度知识追踪 (Deep Knowledge Tracing)、基于表示学习的测量模型 — 能够捕捉更复杂的题目-能力交互模式
+
+### 2. Question Selection Algorithms (选题策略)
+- **经典策略：** Maximum Fisher Information (MFI)、Maximum Posterior Weighted Information (MPWI)
+- **ML 策略：** 基于强化学习的选题、多臂老虎机 (Multi-armed Bandit)、深度 Q-Network — 在信息增益、暴露率控制、内容平衡之间做多目标优化
+
+### 3. Question Bank Construction (题库构建)
+- 题目标定 (calibration)、参数估计、题目质量监控
+- ML 方法可用于自动题目生成、难度预测、题目相似度聚类
+
+### 4. Test Control (测试控制)
+- 终止规则 (stopping criteria)：固定长度 vs 精度阈值
+- 内容平衡约束、题目曝光率控制、公平性约束
+- ML 方法：学习型终止规则、约束满足优化
+
+## 应用领域
+- **教育测评：** K-12 标准化考试、语言能力测试 (GRE, GMAT)
+- **医疗评估：** 症状筛查量表、心理健康测评
+- **体育科学：** 运动员能力分级
+- **社会学研究：** 态度与价值观量表
+- **AI 模型评估：** 自适应 benchmarking，根据模型表现动态调整测试难度（与 [[symbolic-regression]] 等评估场景相关）
+
+## ML 视角的范式转变
+
+传统 CAT 依赖心理测量学和统计学假设（如 IRT 的局部独立性、单维性假设）。随着大规模测试场景复杂度上升，机器学习提供了新的可能性：
+
+| 维度 | 传统心理测量学 | 机器学习方法 |
+|------|--------------|-------------|
+| 建模假设 | 强假设（单维性、局部独立） | 弱假设、数据驱动 |
+| 可扩展性 | 适合中小规模题库 | 天然支持大规模 |
+| 表达能力 | 线性/对数几率 | 非线性、高维交互 |
+| 可解释性 | 高（参数有明确意义） | 较低（黑盒风险） |
+| 公平性 | 已有成熟 DIF 检测 | 正在发展中 |
+
+## IRT 数学形式
+
+Item Response Theory 是传统 CAT 的核心数学引擎。
+
+### 核心符号
+- 考生能力: $\theta \in \mathbb{R}$
+- 题目 $i$ 参数: $\psi_i = (a_i, b_i, c_i)$
+- 作答: $u_i \in \{0, 1\}$
+- ICC (Item Characteristic Curve): $P_i(\theta) = P(u_i = 1 \mid \theta, \psi_i)$
+
+### 模型层级
+
+**1PL (Rasch Model):**
+$$P_i(\theta) = \frac{1}{1 + e^{-(\theta - b_i)}}$$
+仅含难度参数 $b_i$。当 $\theta = b_i$ 时 $P_i = 0.5$。
+
+**2PL (CAT 最常用):**
+$$P_i(\theta) = \frac{1}{1 + e^{-a_i(\theta - b_i)}}$$
+区分度 $a_i > 0$ 控制曲线斜率。导数: $\frac{dP_i}{d\theta} = a_i P_i(1 - P_i)$，在 $\theta = b_i$ 处达最大值 $a_i / 4$。
+
+**3PL (含猜测):**
+$$P_i(\theta) = c_i + (1 - c_i) \frac{1}{1 + e^{-a_i(\theta - b_i)}}$$
+猜测概率 $c_i \in [0,1]$。$\theta \to -\infty$ 时 $P_i \to c_i$。
+
+### Fisher 信息量与选题
+
+题目 $i$ 的 Fisher 信息:
+$$I_i(\theta) = \frac{[\partial P_i / \partial \theta]^2}{P_i(1 - P_i)} = a_i^2 P_i(\theta)(1 - P_i(\theta)) \quad (\text{2PL})$$
+
+- $\theta = b_i$ 时信息量最大: $I_i = a_i^2 / 4$
+- $\theta \gg b_i$ 或 $\theta \ll b_i$ 时 $I_i \to 0$
+
+**CAT 选题:** $i^* = \arg\max_{i} I_i(\hat{\theta}_{\text{当前}})$
+
+### 能力估计
+
+**对数似然:**
+$$\ell(\theta) = \sum_{j=1}^{t} \left[ u_j \ln P_j(\theta) + (1 - u_j) \ln(1 - P_j(\theta)) \right]$$
+
+**Newton-Raphson 迭代:**
+$$\theta^{(k+1)} = \theta^{(k)} + \frac{\ell'(\theta^{(k)})}{I(\theta^{(k)})}, \quad I(\theta) = \sum_{j=1}^t I_j(\theta)$$
+
+**标准误:** $SE(\hat{\theta}) = 1 / \sqrt{I(\hat{\theta})}$
+
+### 多维 IRT (MIRT)
+
+$$P_i(\boldsymbol{\theta}) = \frac{1}{1 + e^{-(\mathbf{a}_i^\top \boldsymbol{\theta} - d_i)}}, \quad \boldsymbol{\theta} \in \mathbb{R}^D$$
+
+对应多维自适应测试 (MAT)，选题需最大化多维信息矩阵的标量函数（行列式或迹）。
+
+## 开放问题与挑战
+1. **公平性与偏差：** 自适应算法可能放大历史数据中的群体偏差
+2. **可解释性：** 深度学习模型的可解释性 vs 心理测量学的透明度
+3. **冷启动问题：** 新题目/新考生的初始参数估计
+4. **安全性：** 题库泄露风险、对抗性攻击
+5. **跨模态测评：** 如何整合文本、图像、交互等多模态数据
+6. **LLM 测评：** 如何用 CAT 范式评估大语言模型能力（自适应 benchmarking）
+
+## 相关概念
+
+- [[cramer-rao-lower-bound]] — CRLB 设定了 CAT 能力估计方差的理论下界，CAT 选题策略本质上是在最大化 Fisher 信息以快速逼近该下界
+- [[symbolic-regression]] — 符号回归中的自适应搜索策略与 CAT 选题策略在"动态探索-利用权衡"上有结构相似性
+- [[knowledge-bank]] — 自适应测评系统需要结构化知识/题库管理，与知识管理系统的设计思想相通
+
+## 关键文献
+- Zhuang et al. (2024/2026). *Survey of Computerized Adaptive Testing: A Machine Learning Perspective*. arXiv:2404.00712v4. Accepted by IEEE TPAMI 2026.