20260420:first commit
This commit is contained in:
100
concepts/symbolic-regression.md
Normal file
100
concepts/symbolic-regression.md
Normal file
@@ -0,0 +1,100 @@
|
||||
---
|
||||
title: "Symbolic Regression"
|
||||
created: 2026-04-16
|
||||
updated: 2026-04-17
|
||||
type: concept
|
||||
tags: [optimization, training, model]
|
||||
sources: [raw/papers/odrzywolek-eml-universal-operator-2026.md]
|
||||
---
|
||||
|
||||
# Symbolic Regression
|
||||
|
||||
**Symbolic regression** is a machine learning technique that discovers explicit mathematical expressions from data, rather than fitting fixed-form models. Unlike traditional regression (which optimizes parameters within a predetermined functional form), symbolic regression searches the space of possible equation structures.
|
||||
|
||||
## Core Problem
|
||||
|
||||
Given data points (xᵢ, yᵢ), find a closed-form expression f such that y ≈ f(x), where f is composed of elementary operations and functions.
|
||||
|
||||
**Key Distinction:**
|
||||
- Traditional regression: y = β₀ + β₁x + β₂x² (form fixed, optimize β)
|
||||
- Symbolic regression: Discover that y = sin(2πx) · e^(-x²) from data
|
||||
|
||||
## Traditional Approaches
|
||||
|
||||
### Genetic Programming
|
||||
|
||||
The dominant approach historically:
|
||||
- **Representation**: Expression trees with heterogeneous nodes (+, -, ×, ÷, sin, exp, etc.)
|
||||
- **Search**: Evolutionary algorithms (mutations, crossovers)
|
||||
- **Fitness**: Mean squared error or complexity-penalized metrics
|
||||
- **Tools**: Eureqa, gplearn, PySR
|
||||
|
||||
**Limitations:**
|
||||
- Discrete search space (combinatorial explosion)
|
||||
- Slow convergence for complex expressions
|
||||
- No gradient information
|
||||
- Brittle to hyperparameters
|
||||
|
||||
### Sparse Regression (SINDy)
|
||||
|
||||
- Assumes sparse linear combination from a library of candidate functions
|
||||
- Uses LASSO/sparse optimization
|
||||
- Faster but limited to linear combinations of basis functions
|
||||
|
||||
## Gradient-Based Approaches
|
||||
|
||||
Recent work enables differentiable symbolic regression:
|
||||
|
||||
### EML Trees (2026)
|
||||
|
||||
[[eml-universal-operator|Odrzywołek's EML representation]] enables gradient-based optimization:
|
||||
- Uniform tree structure (all nodes are `eml` operators)
|
||||
- Fully differentiable
|
||||
- Optimizable with standard deep learning optimizers (Adam)
|
||||
- Can recover exact closed forms at shallow depths (≤4)
|
||||
|
||||
### Neural Symbolic Methods
|
||||
|
||||
- **AI Feynman**: Combines neural network fitting with symbolic property testing
|
||||
- **Symbolic GPT**: Transformer-based generation of expressions
|
||||
- **Deep Symbolic Regression**: Neural networks predicting expression trees
|
||||
|
||||
## Evaluation Metrics
|
||||
|
||||
1. **Accuracy**: R², MSE, NMSE on held-out data
|
||||
2. **Complexity**: Number of nodes, operators, or description length
|
||||
3. **Pareto Frontier**: Trade-off between accuracy and simplicity
|
||||
4. **Exact Recovery**: Whether the true underlying formula is found
|
||||
5. **Generalization**: Performance on out-of-distribution data
|
||||
|
||||
## Applications
|
||||
|
||||
| Domain | Example |
|
||||
|--------|---------|
|
||||
| Physics | Discovering force laws, equations of state |
|
||||
| Chemistry | Reaction kinetics, structure-property relationships |
|
||||
| Biology | Population dynamics, gene regulatory networks |
|
||||
| Engineering | System identification, control laws |
|
||||
| Finance | Discovering pricing formulas, risk models |
|
||||
|
||||
## Challenges
|
||||
|
||||
1. **Scalability**: Exponential growth of expression space with size
|
||||
2. **Noise Sensitivity**: Overfitting to data noise
|
||||
3. **Non-uniqueness**: Multiple expressions may fit data equally well
|
||||
4. **Dimensional Analysis**: Incorporating physical units/constraints
|
||||
5. **Interpretability**: Balancing accuracy with human-understandable forms
|
||||
|
||||
## Future Directions
|
||||
|
||||
- Integration with large language models for prior knowledge
|
||||
- Physics-informed constraints (conservation laws, symmetries)
|
||||
- Multi-objective optimization (accuracy, simplicity, generalization)
|
||||
- Real-time/online symbolic regression
|
||||
- Human-in-the-loop discovery workflows
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[eml-universal-operator]]: A universal operator enabling gradient-based symbolic regression
|
||||
- [[andrzej-odrzywolek]]: Researcher who discovered the EML universal operator
|
||||
- [[computerized-adaptive-testing]]: CAT 中的动态选题策略与符号回归中的自适应搜索在"探索-利用权衡"上有结构相似性
|
||||
Reference in New Issue
Block a user