20260420:first commit

2026-04-20 11:42:41 +08:00
commit dd8345a6ea
45 changed files with 2366 additions and 0 deletions
--- a/concepts/symbolic-regression.md
+++ b/concepts/symbolic-regression.md
@@ -0,0 +1,100 @@
+---
+title: "Symbolic Regression"
+created: 2026-04-16
+updated: 2026-04-17
+type: concept
+tags: [optimization, training, model]
+sources: [raw/papers/odrzywolek-eml-universal-operator-2026.md]
+---
+
+# Symbolic Regression
+
+**Symbolic regression** is a machine learning technique that discovers explicit mathematical expressions from data, rather than fitting fixed-form models. Unlike traditional regression (which optimizes parameters within a predetermined functional form), symbolic regression searches the space of possible equation structures.
+
+## Core Problem
+
+Given data points (xᵢ, yᵢ), find a closed-form expression f such that y ≈ f(x), where f is composed of elementary operations and functions.
+
+**Key Distinction:**
+- Traditional regression: y = β₀ + β₁x + β₂x² (form fixed, optimize β)
+- Symbolic regression: Discover that y = sin(2πx) · e^(-x²) from data
+
+## Traditional Approaches
+
+### Genetic Programming
+
+The dominant approach historically:
+- **Representation**: Expression trees with heterogeneous nodes (+, -, ×, ÷, sin, exp, etc.)
+- **Search**: Evolutionary algorithms (mutations, crossovers)
+- **Fitness**: Mean squared error or complexity-penalized metrics
+- **Tools**: Eureqa, gplearn, PySR
+
+**Limitations:**
+- Discrete search space (combinatorial explosion)
+- Slow convergence for complex expressions
+- No gradient information
+- Brittle to hyperparameters
+
+### Sparse Regression (SINDy)
+
+- Assumes sparse linear combination from a library of candidate functions
+- Uses LASSO/sparse optimization
+- Faster but limited to linear combinations of basis functions
+
+## Gradient-Based Approaches
+
+Recent work enables differentiable symbolic regression:
+
+### EML Trees (2026)
+
+[[eml-universal-operator|Odrzywołek's EML representation]] enables gradient-based optimization:
+- Uniform tree structure (all nodes are `eml` operators)
+- Fully differentiable
+- Optimizable with standard deep learning optimizers (Adam)
+- Can recover exact closed forms at shallow depths (≤4)
+
+### Neural Symbolic Methods
+
+- **AI Feynman**: Combines neural network fitting with symbolic property testing
+- **Symbolic GPT**: Transformer-based generation of expressions
+- **Deep Symbolic Regression**: Neural networks predicting expression trees
+
+## Evaluation Metrics
+
+1. **Accuracy**: R², MSE, NMSE on held-out data
+2. **Complexity**: Number of nodes, operators, or description length
+3. **Pareto Frontier**: Trade-off between accuracy and simplicity
+4. **Exact Recovery**: Whether the true underlying formula is found
+5. **Generalization**: Performance on out-of-distribution data
+
+## Applications
+
+| Domain | Example |
+|--------|---------|
+| Physics | Discovering force laws, equations of state |
+| Chemistry | Reaction kinetics, structure-property relationships |
+| Biology | Population dynamics, gene regulatory networks |
+| Engineering | System identification, control laws |
+| Finance | Discovering pricing formulas, risk models |
+
+## Challenges
+
+1. **Scalability**: Exponential growth of expression space with size
+2. **Noise Sensitivity**: Overfitting to data noise
+3. **Non-uniqueness**: Multiple expressions may fit data equally well
+4. **Dimensional Analysis**: Incorporating physical units/constraints
+5. **Interpretability**: Balancing accuracy with human-understandable forms
+
+## Future Directions
+
+- Integration with large language models for prior knowledge
+- Physics-informed constraints (conservation laws, symmetries)
+- Multi-objective optimization (accuracy, simplicity, generalization)
+- Real-time/online symbolic regression
+- Human-in-the-loop discovery workflows
+
+## Related Concepts
+
+- [[eml-universal-operator]]: A universal operator enabling gradient-based symbolic regression
+- [[andrzej-odrzywolek]]: Researcher who discovered the EML universal operator
+- [[computerized-adaptive-testing]]: CAT 中的动态选题策略与符号回归中的自适应搜索在"探索-利用权衡"上有结构相似性