SidneyZhang/myWiki

Fork 0

Files

Sidney Zhang dd8345a6ea

20260420:first commit

2026-04-20 11:42:41 +08:00

3.8 KiB

Raw Blame History

title, created, updated, type, tags, sources

title

created

updated

type

Symbolic Regression

Symbolic regression is a machine learning technique that discovers explicit mathematical expressions from data, rather than fitting fixed-form models. Unlike traditional regression (which optimizes parameters within a predetermined functional form), symbolic regression searches the space of possible equation structures.

Core Problem

Given data points (xᵢ, yᵢ), find a closed-form expression f such that y ≈ f(x), where f is composed of elementary operations and functions.

Key Distinction:

Traditional regression: y = β₀ + β₁x + β₂x² (form fixed, optimize β)
Symbolic regression: Discover that y = sin(2πx) · e^(-x²) from data

Traditional Approaches

Genetic Programming

The dominant approach historically:

Representation: Expression trees with heterogeneous nodes (+, -, ×, ÷, sin, exp, etc.)
Search: Evolutionary algorithms (mutations, crossovers)
Fitness: Mean squared error or complexity-penalized metrics
Tools: Eureqa, gplearn, PySR

Limitations:

Discrete search space (combinatorial explosion)
Slow convergence for complex expressions
No gradient information
Brittle to hyperparameters

Sparse Regression (SINDy)

Assumes sparse linear combination from a library of candidate functions
Uses LASSO/sparse optimization
Faster but limited to linear combinations of basis functions

Gradient-Based Approaches

Recent work enables differentiable symbolic regression:

EML Trees (2026)

eml-universal-operator enables gradient-based optimization:

Uniform tree structure (all nodes are eml operators)
Fully differentiable
Optimizable with standard deep learning optimizers (Adam)
Can recover exact closed forms at shallow depths (≤4)

Neural Symbolic Methods

AI Feynman: Combines neural network fitting with symbolic property testing
Symbolic GPT: Transformer-based generation of expressions
Deep Symbolic Regression: Neural networks predicting expression trees

Evaluation Metrics

Accuracy: R², MSE, NMSE on held-out data
Complexity: Number of nodes, operators, or description length
Pareto Frontier: Trade-off between accuracy and simplicity
Exact Recovery: Whether the true underlying formula is found
Generalization: Performance on out-of-distribution data

Applications

Domain	Example
Physics	Discovering force laws, equations of state
Chemistry	Reaction kinetics, structure-property relationships
Biology	Population dynamics, gene regulatory networks
Engineering	System identification, control laws
Finance	Discovering pricing formulas, risk models

Challenges

Scalability: Exponential growth of expression space with size
Noise Sensitivity: Overfitting to data noise
Non-uniqueness: Multiple expressions may fit data equally well
Dimensional Analysis: Incorporating physical units/constraints
Interpretability: Balancing accuracy with human-understandable forms

Future Directions

Integration with large language models for prior knowledge
Physics-informed constraints (conservation laws, symmetries)
Multi-objective optimization (accuracy, simplicity, generalization)
Real-time/online symbolic regression
Human-in-the-loop discovery workflows

eml-universal-operator: A universal operator enabling gradient-based symbolic regression
andrzej-odrzywolek: Researcher who discovered the EML universal operator
computerized-adaptive-testing: CAT 中的动态选题策略与符号回归中的自适应搜索在"探索-利用权衡"上有结构相似性

3.8 KiB Raw Blame History Unescape Escape