--- title: "Symbolic Regression" created: 2026-04-16 updated: 2026-04-17 type: concept tags: [optimization, training, model] sources: [raw/papers/odrzywolek-eml-universal-operator-2026.md] --- # Symbolic Regression **Symbolic regression** is a machine learning technique that discovers explicit mathematical expressions from data, rather than fitting fixed-form models. Unlike traditional regression (which optimizes parameters within a predetermined functional form), symbolic regression searches the space of possible equation structures. ## Core Problem Given data points (xᵢ, yᵢ), find a closed-form expression f such that y ≈ f(x), where f is composed of elementary operations and functions. **Key Distinction:** - Traditional regression: y = β₀ + β₁x + β₂x² (form fixed, optimize β) - Symbolic regression: Discover that y = sin(2πx) · e^(-x²) from data ## Traditional Approaches ### Genetic Programming The dominant approach historically: - **Representation**: Expression trees with heterogeneous nodes (+, -, ×, ÷, sin, exp, etc.) - **Search**: Evolutionary algorithms (mutations, crossovers) - **Fitness**: Mean squared error or complexity-penalized metrics - **Tools**: Eureqa, gplearn, PySR **Limitations:** - Discrete search space (combinatorial explosion) - Slow convergence for complex expressions - No gradient information - Brittle to hyperparameters ### Sparse Regression (SINDy) - Assumes sparse linear combination from a library of candidate functions - Uses LASSO/sparse optimization - Faster but limited to linear combinations of basis functions ## Gradient-Based Approaches Recent work enables differentiable symbolic regression: ### EML Trees (2026) [[eml-operator|Odrzywołek's EML representation]] enables gradient-based optimization: - Uniform tree structure (all nodes are `eml` operators) - Fully differentiable - Optimizable with standard deep learning optimizers (Adam) - Can recover exact closed forms at shallow depths (≤4) ### Neural Symbolic Methods - **AI Feynman**: Combines neural network fitting with symbolic property testing - **Symbolic GPT**: Transformer-based generation of expressions - **Deep Symbolic Regression**: Neural networks predicting expression trees ## Evaluation Metrics 1. **Accuracy**: R², MSE, NMSE on held-out data 2. **Complexity**: Number of nodes, operators, or description length 3. **Pareto Frontier**: Trade-off between accuracy and simplicity 4. **Exact Recovery**: Whether the true underlying formula is found 5. **Generalization**: Performance on out-of-distribution data ## Applications | Domain | Example | |--------|---------| | Physics | Discovering force laws, equations of state | | Chemistry | Reaction kinetics, structure-property relationships | | Biology | Population dynamics, gene regulatory networks | | Engineering | System identification, control laws | | Finance | Discovering pricing formulas, risk models | ## Challenges 1. **Scalability**: Exponential growth of expression space with size 2. **Noise Sensitivity**: Overfitting to data noise 3. **Non-uniqueness**: Multiple expressions may fit data equally well 4. **Dimensional Analysis**: Incorporating physical units/constraints 5. **Interpretability**: Balancing accuracy with human-understandable forms ## Future Directions - Integration with large language models for prior knowledge - Physics-informed constraints (conservation laws, symmetries) - Multi-objective optimization (accuracy, simplicity, generalization) - Real-time/online symbolic regression - Human-in-the-loop discovery workflows ## Related Concepts - [[eml-operator]]: A universal operator enabling gradient-based symbolic regression - [[andrzej-odrzywolek]]: Researcher who discovered the EML universal operator - [[computerized-adaptive-testing]]: CAT 中的动态选题策略与符号回归中的自适应搜索在"探索-利用权衡"上有结构相似性