3.8 KiB
3.8 KiB
title, created, updated, type, tags, sources
| title | created | updated | type | tags | sources | ||||
|---|---|---|---|---|---|---|---|---|---|
| Symbolic Regression | 2026-04-16 | 2026-04-17 | concept |
|
|
Symbolic Regression
Symbolic regression is a machine learning technique that discovers explicit mathematical expressions from data, rather than fitting fixed-form models. Unlike traditional regression (which optimizes parameters within a predetermined functional form), symbolic regression searches the space of possible equation structures.
Core Problem
Given data points (xᵢ, yᵢ), find a closed-form expression f such that y ≈ f(x), where f is composed of elementary operations and functions.
Key Distinction:
- Traditional regression: y = β₀ + β₁x + β₂x² (form fixed, optimize β)
- Symbolic regression: Discover that y = sin(2πx) · e^(-x²) from data
Traditional Approaches
Genetic Programming
The dominant approach historically:
- Representation: Expression trees with heterogeneous nodes (+, -, ×, ÷, sin, exp, etc.)
- Search: Evolutionary algorithms (mutations, crossovers)
- Fitness: Mean squared error or complexity-penalized metrics
- Tools: Eureqa, gplearn, PySR
Limitations:
- Discrete search space (combinatorial explosion)
- Slow convergence for complex expressions
- No gradient information
- Brittle to hyperparameters
Sparse Regression (SINDy)
- Assumes sparse linear combination from a library of candidate functions
- Uses LASSO/sparse optimization
- Faster but limited to linear combinations of basis functions
Gradient-Based Approaches
Recent work enables differentiable symbolic regression:
EML Trees (2026)
eml-universal-operator enables gradient-based optimization:
- Uniform tree structure (all nodes are
emloperators) - Fully differentiable
- Optimizable with standard deep learning optimizers (Adam)
- Can recover exact closed forms at shallow depths (≤4)
Neural Symbolic Methods
- AI Feynman: Combines neural network fitting with symbolic property testing
- Symbolic GPT: Transformer-based generation of expressions
- Deep Symbolic Regression: Neural networks predicting expression trees
Evaluation Metrics
- Accuracy: R², MSE, NMSE on held-out data
- Complexity: Number of nodes, operators, or description length
- Pareto Frontier: Trade-off between accuracy and simplicity
- Exact Recovery: Whether the true underlying formula is found
- Generalization: Performance on out-of-distribution data
Applications
| Domain | Example |
|---|---|
| Physics | Discovering force laws, equations of state |
| Chemistry | Reaction kinetics, structure-property relationships |
| Biology | Population dynamics, gene regulatory networks |
| Engineering | System identification, control laws |
| Finance | Discovering pricing formulas, risk models |
Challenges
- Scalability: Exponential growth of expression space with size
- Noise Sensitivity: Overfitting to data noise
- Non-uniqueness: Multiple expressions may fit data equally well
- Dimensional Analysis: Incorporating physical units/constraints
- Interpretability: Balancing accuracy with human-understandable forms
Future Directions
- Integration with large language models for prior knowledge
- Physics-informed constraints (conservation laws, symmetries)
- Multi-objective optimization (accuracy, simplicity, generalization)
- Real-time/online symbolic regression
- Human-in-the-loop discovery workflows
Related Concepts
- eml-universal-operator: A universal operator enabling gradient-based symbolic regression
- andrzej-odrzywolek: Researcher who discovered the EML universal operator
- computerized-adaptive-testing: CAT 中的动态选题策略与符号回归中的自适应搜索在"探索-利用权衡"上有结构相似性