Files
myWiki/concepts/symbolic-regression.md

101 lines
3.8 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Symbolic Regression"
created: 2026-04-16
updated: 2026-04-17
type: concept
tags: [optimization, training, model]
sources: [raw/papers/odrzywolek-eml-universal-operator-2026.md]
---
# Symbolic Regression
**Symbolic regression** is a machine learning technique that discovers explicit mathematical expressions from data, rather than fitting fixed-form models. Unlike traditional regression (which optimizes parameters within a predetermined functional form), symbolic regression searches the space of possible equation structures.
## Core Problem
Given data points (xᵢ, yᵢ), find a closed-form expression f such that y ≈ f(x), where f is composed of elementary operations and functions.
**Key Distinction:**
- Traditional regression: y = β₀ + β₁x + β₂x² (form fixed, optimize β)
- Symbolic regression: Discover that y = sin(2πx) · e^(-x²) from data
## Traditional Approaches
### Genetic Programming
The dominant approach historically:
- **Representation**: Expression trees with heterogeneous nodes (+, -, ×, ÷, sin, exp, etc.)
- **Search**: Evolutionary algorithms (mutations, crossovers)
- **Fitness**: Mean squared error or complexity-penalized metrics
- **Tools**: Eureqa, gplearn, PySR
**Limitations:**
- Discrete search space (combinatorial explosion)
- Slow convergence for complex expressions
- No gradient information
- Brittle to hyperparameters
### Sparse Regression (SINDy)
- Assumes sparse linear combination from a library of candidate functions
- Uses LASSO/sparse optimization
- Faster but limited to linear combinations of basis functions
## Gradient-Based Approaches
Recent work enables differentiable symbolic regression:
### EML Trees (2026)
[[eml-operator|Odrzywołek's EML representation]] enables gradient-based optimization:
- Uniform tree structure (all nodes are `eml` operators)
- Fully differentiable
- Optimizable with standard deep learning optimizers (Adam)
- Can recover exact closed forms at shallow depths (≤4)
### Neural Symbolic Methods
- **AI Feynman**: Combines neural network fitting with symbolic property testing
- **Symbolic GPT**: Transformer-based generation of expressions
- **Deep Symbolic Regression**: Neural networks predicting expression trees
## Evaluation Metrics
1. **Accuracy**: R², MSE, NMSE on held-out data
2. **Complexity**: Number of nodes, operators, or description length
3. **Pareto Frontier**: Trade-off between accuracy and simplicity
4. **Exact Recovery**: Whether the true underlying formula is found
5. **Generalization**: Performance on out-of-distribution data
## Applications
| Domain | Example |
|--------|---------|
| Physics | Discovering force laws, equations of state |
| Chemistry | Reaction kinetics, structure-property relationships |
| Biology | Population dynamics, gene regulatory networks |
| Engineering | System identification, control laws |
| Finance | Discovering pricing formulas, risk models |
## Challenges
1. **Scalability**: Exponential growth of expression space with size
2. **Noise Sensitivity**: Overfitting to data noise
3. **Non-uniqueness**: Multiple expressions may fit data equally well
4. **Dimensional Analysis**: Incorporating physical units/constraints
5. **Interpretability**: Balancing accuracy with human-understandable forms
## Future Directions
- Integration with large language models for prior knowledge
- Physics-informed constraints (conservation laws, symmetries)
- Multi-objective optimization (accuracy, simplicity, generalization)
- Real-time/online symbolic regression
- Human-in-the-loop discovery workflows
## Related Concepts
- [[eml-operator]]: A universal operator enabling gradient-based symbolic regression
- [[andrzej-odrzywolek]]: Researcher who discovered the EML universal operator
- [[computerized-adaptive-testing]]: CAT 中的动态选题策略与符号回归中的自适应搜索在"探索-利用权衡"上有结构相似性