The First Comprehensive Evaluation Framework for
Formulaic Factor Mining by Large Language Models
A unified benchmark and toolchain for systematically evaluating LLMs in formulaic alpha factor mining, covering generation, evaluation, and searching tasks with executable financial DSLs.
This workflow illustrates how LLMs interact with prompts and backtesting engine (Qlib) to generate and refine formulaic alpha factors.
Recent works show that LLMs can generate alpha factors, but their true capabilities remain unclear:
AlphaBench is the first systematic benchmark designed to evaluate LLMs across the entire formulaic alpha mining workflow, using:
Covers the full lifecycle of alpha factors: generation → evaluation → searching
Standardized searching baselines for LLM-driven factor discovery, including chain-based refinement, tree-based exploration, and evolutionary algorithms
Reliability, stability, accuracy, ranking precision, scoring error, search improvement, and cost
Supports Chain-of-Experience, Tree-of-Thought, and Evolutionary Algorithms as comparable baselines
Built on Qlib backtesting framework, enabling iterative factor search and automated evaluation with real market data
Explicit token usage and efficiency evaluation for real-world deployment
AlphaBench evaluates LLMs through structured tasks that mirror real-world factor mining workflows.
LLMs transform natural language descriptions into candidate alpha factors
From abstract financial concepts (e.g., momentum, mean reversion) to executable formulas
Generate multiple diverse factors under a given semantic direction (e.g., volatility, trend)
LLMs act as intelligent evaluators or "judges", predicting factor quality without full backtesting
Select top-K factors from a candidate pool
Predict factor quality (signal type, performance, robustness, win rate, skewness)
Zero-shot factor evaluation remains the weakest capability of current LLMs
Evaluate whether LLMs can iteratively improve factors under different search paradigms
Sequential refinement: Starting from a seed factor, the LLM iteratively improves the formula based on execution feedback.
Branching exploration: Multiple factor variants are generated at each node, with poor branches pruned based on performance.
Population-based optimization: Factors evolve through mutation and crossover. Selection pressure based on IC/RankIC fitness drives the population toward higher-performing formulas. This paradigm shows superior search efficiency.
Comprehensive evaluation across all tasks reveals distinct patterns in LLM capabilities.
AlphaBench is a portable evaluation template for any executable DSL reasoning task, including:
Designed for research benchmarking and real-world system design.
@inproceedings{luo2026alphabench,
title={AlphaBench: Benchmarking Large Language Models in Formulaic Alpha Factor Mining},
author={Haochen Luo and Ho Tin Ko and Jiandong Chen and David Sun and Yuan Zhang and Chen Liu},
booktitle={International Conference on Learning Representations},
year={2026}
}