Files
myWiki/concepts/symbolic-regression.md

3.8 KiB
Raw Blame History

title, created, updated, type, tags, sources
title created updated type tags sources
Symbolic Regression 2026-04-16 2026-04-17 concept
optimization
training
model
raw/papers/odrzywolek-eml-universal-operator-2026.md

Symbolic Regression

Symbolic regression is a machine learning technique that discovers explicit mathematical expressions from data, rather than fitting fixed-form models. Unlike traditional regression (which optimizes parameters within a predetermined functional form), symbolic regression searches the space of possible equation structures.

Core Problem

Given data points (xᵢ, yᵢ), find a closed-form expression f such that y ≈ f(x), where f is composed of elementary operations and functions.

Key Distinction:

  • Traditional regression: y = β₀ + β₁x + β₂x² (form fixed, optimize β)
  • Symbolic regression: Discover that y = sin(2πx) · e^(-x²) from data

Traditional Approaches

Genetic Programming

The dominant approach historically:

  • Representation: Expression trees with heterogeneous nodes (+, -, ×, ÷, sin, exp, etc.)
  • Search: Evolutionary algorithms (mutations, crossovers)
  • Fitness: Mean squared error or complexity-penalized metrics
  • Tools: Eureqa, gplearn, PySR

Limitations:

  • Discrete search space (combinatorial explosion)
  • Slow convergence for complex expressions
  • No gradient information
  • Brittle to hyperparameters

Sparse Regression (SINDy)

  • Assumes sparse linear combination from a library of candidate functions
  • Uses LASSO/sparse optimization
  • Faster but limited to linear combinations of basis functions

Gradient-Based Approaches

Recent work enables differentiable symbolic regression:

EML Trees (2026)

eml-operator enables gradient-based optimization:

  • Uniform tree structure (all nodes are eml operators)
  • Fully differentiable
  • Optimizable with standard deep learning optimizers (Adam)
  • Can recover exact closed forms at shallow depths (≤4)

Neural Symbolic Methods

  • AI Feynman: Combines neural network fitting with symbolic property testing
  • Symbolic GPT: Transformer-based generation of expressions
  • Deep Symbolic Regression: Neural networks predicting expression trees

Evaluation Metrics

  1. Accuracy: R², MSE, NMSE on held-out data
  2. Complexity: Number of nodes, operators, or description length
  3. Pareto Frontier: Trade-off between accuracy and simplicity
  4. Exact Recovery: Whether the true underlying formula is found
  5. Generalization: Performance on out-of-distribution data

Applications

Domain Example
Physics Discovering force laws, equations of state
Chemistry Reaction kinetics, structure-property relationships
Biology Population dynamics, gene regulatory networks
Engineering System identification, control laws
Finance Discovering pricing formulas, risk models

Challenges

  1. Scalability: Exponential growth of expression space with size
  2. Noise Sensitivity: Overfitting to data noise
  3. Non-uniqueness: Multiple expressions may fit data equally well
  4. Dimensional Analysis: Incorporating physical units/constraints
  5. Interpretability: Balancing accuracy with human-understandable forms

Future Directions

  • Integration with large language models for prior knowledge
  • Physics-informed constraints (conservation laws, symmetries)
  • Multi-objective optimization (accuracy, simplicity, generalization)
  • Real-time/online symbolic regression
  • Human-in-the-loop discovery workflows