The First Comprehensive Evaluation Framework for
Formulaic Factor Mining by Large Language Models
A unified benchmark and toolchain for systematically evaluating LLMs in formulaic alpha factor mining, covering generation, evaluation, and searching tasks with executable financial DSLs.
Direct generation, quality evaluation, and iterative searching — covering the full alpha mining workflow
End-to-end toolchain with 1,857 instructions, FFO execution engine, and Qlib-based backtesting
Three standardized search paradigms: Chain-of-Experience, Tree-of-Thought, and Evolutionary Algorithms
Alpha factors are mathematical expressions derived from market data (price, volume, fundamentals) that predict future stock returns — the core building block of quantitative investment strategies. Mining effective alpha factors has traditionally required deep domain expertise and exhaustive manual search.
Large Language Models offer a promising paradigm shift: they can encode financial knowledge, generate factor candidates from natural language descriptions, and iteratively refine formulas with execution feedback. AlphaBench is the first systematic benchmark to rigorously evaluate LLMs across the entire formulaic alpha mining workflow — from generation to evaluation to iterative searching.
Figure: Overall benchmark — AlphaBench evaluates LLMs across the full formulaic alpha mining workflow using real financial data and task-aligned quantitative metrics.
Figure: Task description — AlphaBench decomposes alpha mining into three core tasks: factor generation, factor evaluation, and iterative factor searching.
Can LLMs translate financial concepts into executable alpha factor formulas? We probe two scenarios: (1) Text2Alpha — converting a textual description (e.g., "5-day momentum adjusted for volatility") into a syntactically valid formula; (2) Directional Mining — generating a diverse set of factors under a given semantic direction (e.g., "volatility-based"). We measure reliability, output stability, and semantic alignment.
Running a full backtesting simulation is expensive. We ask: can LLMs act as zero-shot judges and predict factor quality without executing the formula? This includes ranking a candidate pool to select top-K factors, and scoring each factor on multiple quality dimensions — IC, RankIC, robustness, win rate, and skewness. Our findings reveal this remains the weakest capability of current LLMs.
Different searching algorithms make different demands on LLMs. We evaluate LLMs as components within three paradigms: Chain-of-Experience (sequential refinement), Tree-of-Thought (branching exploration), and Evolutionary Algorithms (population-based mutation & crossover). We provide standardized baseline implementations and compare search quality, cost efficiency, and factor diversity across all three approaches.
The generation instruction set probes whether LLMs can translate financial knowledge into executable formulaic alpha factor expressions. We design two complementary subtasks to cover different generation scenarios — from concept-grounded translation to open-ended exploration.
The model receives a natural language description of a financial concept and must produce a syntactically valid, executable alpha formula. Inputs range from simple directional ideas (e.g., "price momentum") to compound multi-operator expressions. Ground truth formulas are curated from expert-designed Alpha158 factors with aligned textual annotations.
Given a semantic direction (e.g., "volatility-based factors"), the model must generate a diverse set of K valid alpha formulas that collectively explore that concept space. This tests the model's ability to avoid degenerate repetition and produce functionally distinct candidates under a single thematic prompt.
(-1 * Rank( (close / Mean(close, 5) - 1) / Std(close, 20) ))
FFO is our core factor execution and evaluation engine, built on top of the Qlib backtesting framework. It serves as the closed-loop feedback mechanism that bridges LLM outputs with real market performance signals.
Automatically parses and checks whether a generated formula is syntactically valid and executable within the Qlib DSL. Returns structured error messages for repair loops.
Computes IC (Information Coefficient), RankIC, ICIR, and annualized returns for each factor on real market data from CSI300 and SP500 (2020–2025).
Provides structured backtesting feedback — including metric values, error traces, and comparative rankings — that LLMs use for closed-loop factor refinement.
Evaluates factors on both CSI300 (China A-share) and SP500 (US equities) to measure cross-market generalization and robustness.
Records per-call token usage and API costs throughout the evaluation pipeline, enabling cost-aware comparison across models and search strategies.
Exposes a unified API compatible with all three searching paradigms (CoE, ToT, EA), enabling plug-and-play swapping of LLM backends and search strategies.
This task asks: can LLMs act as zero-shot factor quality judges without executing any backtesting? We evaluate two levels of judgment — coarse-grained ranking and fine-grained scoring — and construct a ground-truth labeled dataset via Qlib backtesting to enable rigorous quantitative evaluation.
Given a pool of K candidate alpha factors (formulas + metadata), the LLM must select and rank the top factors by predicted IC/RankIC performance. We measure NDCG@K and Precision@K against ground-truth backtesting rankings.
The LLM receives a single alpha factor formula and must predict quantitative quality scores across multiple dimensions. We evaluate prediction error (MAE, RMSE) against Qlib-computed ground truth values.
Seed factors from Alpha158 + LLM-generated factors from the generation task. Deduplicated and syntax-validated via FFO.
Each factor is executed on CSI300 and SP500 (2020–2025) via Qlib to produce ground truth IC, RankIC, and auxiliary metrics.
Ranking queries (factor pools with GT rankings) and scoring queries (factor + score label pairs) are structured into 800 evaluation instances.
We provide three standardized baseline searching agents that integrate LLMs with the FFO execution engine for iterative alpha factor discovery. Each agent represents a distinct algorithmic paradigm, enabling direct comparison of search efficiency, quality, and cost.
Starting from a seed factor, the LLM iteratively refines the formula based on structured backtesting feedback from the previous round. Each iteration's IC/RankIC result, along with the formula and error trace, is appended to a growing context chain that guides subsequent generations.
Multiple factor variants are generated at each tree node. Branches are evaluated via FFO and poor-performing ones are pruned while promising branches are expanded further. This enables broader coverage of the factor space compared to linear chains.
A population of alpha factor candidates evolves through two LLM-driven genetic operators: Mutation (modifying a single factor's formula structure) and Crossover (combining sub-expressions from two parent factors). Selection pressure based on IC/RankIC fitness drives the population toward higher-performing formulas across generations. Our experiments show this paradigm achieves the best search efficiency.
Comprehensive evaluation across all tasks reveals distinct patterns in LLM capabilities.
@inproceedings{
luo2026alphabench,
title={AlphaBench: Benchmarking Large Language Models in Formulaic Alpha Factor Mining},
author={Haochen Luo and Ho Tin Ko and Jiandong Chen and David Sun and Yuan Zhang and Chen Liu},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=d97Q8r7ZKZ}
}