α
ICLR 2026 Conference Paper

AlphaBench

The First Comprehensive Evaluation Framework for
Formulaic Factor Mining by Large Language Models

A unified benchmark and toolchain for systematically evaluating LLMs in formulaic alpha factor mining, covering generation, evaluation, and searching tasks with executable financial DSLs.

Scroll to explore
3 Critical Tasks
Problem Sets

Direct generation, quality evaluation, and iterative searching — covering the full alpha mining workflow

Factor Mining
Toolchain

End-to-end toolchain with 1,857 instructions, FFO execution engine, and Qlib-based backtesting

Baseline
Searching Agents

Three standardized search paradigms: Chain-of-Experience, Tree-of-Thought, and Evolutionary Algorithms

Overview

LLM-powered Alpha Factor Mining

Alpha factors are mathematical expressions derived from market data (price, volume, fundamentals) that predict future stock returns — the core building block of quantitative investment strategies. Mining effective alpha factors has traditionally required deep domain expertise and exhaustive manual search.

Large Language Models offer a promising paradigm shift: they can encode financial knowledge, generate factor candidates from natural language descriptions, and iteratively refine formulas with execution feedback. AlphaBench is the first systematic benchmark to rigorously evaluate LLMs across the entire formulaic alpha mining workflow — from generation to evaluation to iterative searching.

AlphaBench overall benchmark overview

Figure: Overall benchmark — AlphaBench evaluates LLMs across the full formulaic alpha mining workflow using real financial data and task-aligned quantitative metrics.

AlphaBench task description — three core evaluation tasks

Figure: Task description — AlphaBench decomposes alpha mining into three core tasks: factor generation, factor evaluation, and iterative factor searching.

Critical Tasks We Care

01

Direct Factor Generation

Text2Alpha DirectionalMining

Can LLMs translate financial concepts into executable alpha factor formulas? We probe two scenarios: (1) Text2Alpha — converting a textual description (e.g., "5-day momentum adjusted for volatility") into a syntactically valid formula; (2) Directional Mining — generating a diverse set of factors under a given semantic direction (e.g., "volatility-based"). We measure reliability, output stability, and semantic alignment.

02

Can LLMs Go Beyond Backtesting?

FactorEval

Running a full backtesting simulation is expensive. We ask: can LLMs act as zero-shot judges and predict factor quality without executing the formula? This includes ranking a candidate pool to select top-K factors, and scoring each factor on multiple quality dimensions — IC, RankIC, robustness, win rate, and skewness. Our findings reveal this remains the weakest capability of current LLMs.

03

LLMs Adaptive to Mining Algorithms

CoE ToT EA

Different searching algorithms make different demands on LLMs. We evaluate LLMs as components within three paradigms: Chain-of-Experience (sequential refinement), Tree-of-Thought (branching exploration), and Evolutionary Algorithms (population-based mutation & crossover). We provide standardized baseline implementations and compare search quality, cost efficiency, and factor diversity across all three approaches.

Framework

01 Direct Generation Instruction Set

The generation instruction set probes whether LLMs can translate financial knowledge into executable formulaic alpha factor expressions. We design two complementary subtasks to cover different generation scenarios — from concept-grounded translation to open-ended exploration.

Task Design
Text2Alpha

The model receives a natural language description of a financial concept and must produce a syntactically valid, executable alpha formula. Inputs range from simple directional ideas (e.g., "price momentum") to compound multi-operator expressions. Ground truth formulas are curated from expert-designed Alpha158 factors with aligned textual annotations.

Reliability Stability Semantic Accuracy
Directional Mining

Given a semantic direction (e.g., "volatility-based factors"), the model must generate a diverse set of K valid alpha formulas that collectively explore that concept space. This tests the model's ability to avoid degenerate repetition and produce functionally distinct candidates under a single thematic prompt.

Diversity Coverage Validity Rate
Data Sample
Text2Alpha — Input
"Generate an alpha factor that captures short-term price reversal over a 5-day window, normalized by the stock's recent volatility."
Expected Output
(-1 * Rank(
  (close / Mean(close, 5) - 1) /
  Std(close, 20)
))
DirectionalMining — Prompt
"Generate 5 diverse alpha factors based on the theme: market liquidity and trading activity."
857 Text2Alpha Instructions
200 Directional Prompts
158 Base Factors (Alpha158)
02 Formulaic Factor Optimization (FFO)

FFO is our core factor execution and evaluation engine, built on top of the Qlib backtesting framework. It serves as the closed-loop feedback mechanism that bridges LLM outputs with real market performance signals.

Syntax Validation

Automatically parses and checks whether a generated formula is syntactically valid and executable within the Qlib DSL. Returns structured error messages for repair loops.

Performance Metrics

Computes IC (Information Coefficient), RankIC, ICIR, and annualized returns for each factor on real market data from CSI300 and SP500 (2020–2025).

Iterative Feedback

Provides structured backtesting feedback — including metric values, error traces, and comparative rankings — that LLMs use for closed-loop factor refinement.

Multi-Market Support

Evaluates factors on both CSI300 (China A-share) and SP500 (US equities) to measure cross-market generalization and robustness.

Cost Tracking

Records per-call token usage and API costs throughout the evaluation pipeline, enabling cost-aware comparison across models and search strategies.

Standardized Interface

Exposes a unified API compatible with all three searching paradigms (CoE, ToT, EA), enabling plug-and-play swapping of LLM backends and search strategies.

FFO is publicly available as part of the AlphaBench codebase. It can be used independently as a factor backtesting and evaluation utility for quantitative research.
03 Evaluation Suite for Alpha Factors

This task asks: can LLMs act as zero-shot factor quality judges without executing any backtesting? We evaluate two levels of judgment — coarse-grained ranking and fine-grained scoring — and construct a ground-truth labeled dataset via Qlib backtesting to enable rigorous quantitative evaluation.

Ranking Task

Given a pool of K candidate alpha factors (formulas + metadata), the LLM must select and rank the top factors by predicted IC/RankIC performance. We measure NDCG@K and Precision@K against ground-truth backtesting rankings.

Pool size: 10–50 candidates per query
Metric: NDCG@3, NDCG@5, Precision@K
Baseline: Random ranking, IC-sorted oracle
Scoring Task

The LLM receives a single alpha factor formula and must predict quantitative quality scores across multiple dimensions. We evaluate prediction error (MAE, RMSE) against Qlib-computed ground truth values.

Dimensions: IC, RankIC, Win Rate, Skewness, Signal Type
Metric: MAE, RMSE per dimension
Finding: LLMs struggle most with quantitative score prediction
Data Construction Pipeline
1
Factor Pool Construction

Seed factors from Alpha158 + LLM-generated factors from the generation task. Deduplicated and syntax-validated via FFO.

2
Backtesting & Labeling

Each factor is executed on CSI300 and SP500 (2020–2025) via Qlib to produce ground truth IC, RankIC, and auxiliary metrics.

3
Query Construction

Ranking queries (factor pools with GT rankings) and scoring queries (factor + score label pairs) are structured into 800 evaluation instances.

04 Searching Agents

We provide three standardized baseline searching agents that integrate LLMs with the FFO execution engine for iterative alpha factor discovery. Each agent represents a distinct algorithmic paradigm, enabling direct comparison of search efficiency, quality, and cost.

Chain-of-Experience (CoE)
Sequential Refinement

Starting from a seed factor, the LLM iteratively refines the formula based on structured backtesting feedback from the previous round. Each iteration's IC/RankIC result, along with the formula and error trace, is appended to a growing context chain that guides subsequent generations.

Chain-of-Experience diagram
Strength: Simple, low-cost, interpretable refinement trajectory
Limitation: Prone to local optima; no parallel exploration
Tree-of-Thought (ToT)
Branching Exploration

Multiple factor variants are generated at each tree node. Branches are evaluated via FFO and poor-performing ones are pruned while promising branches are expanded further. This enables broader coverage of the factor space compared to linear chains.

Tree-of-Thought diagram
Strength: Broader search coverage; avoids single-path collapse
Limitation: Higher token cost; tree management overhead
Evolutionary Algorithm (EA)
Population-Based Optimization

A population of alpha factor candidates evolves through two LLM-driven genetic operators: Mutation (modifying a single factor's formula structure) and Crossover (combining sub-expressions from two parent factors). Selection pressure based on IC/RankIC fitness drives the population toward higher-performing formulas across generations. Our experiments show this paradigm achieves the best search efficiency.

Evolutionary Algorithm diagram
Strength: Best IC improvement; diverse population; escapes local optima
Finding: Outperforms CoE and ToT on both CSI300 and SP500
All three agents share a unified searching platform with a consistent evaluation harness, enabling reproducible benchmarking of LLM searching capabilities across models and markets. The platform supports configurable search budgets (rounds, tokens, parallelism) and outputs standardized result logs for analysis.

Takeaway

Model Performance Comparison Radar Charts

Model Performance Comparison

Comprehensive evaluation across all tasks reveals distinct patterns in LLM capabilities.

Generation & Evaluation Measuring reliability, stability, accuracy, ranking precision
Searching Performance IC improvement, cost efficiency, exploration diversity
Key Insight Mid-sized commercial models offer best cost-performance tradeoff

What LLMs Do Well

  • Reliable factor generation (high syntactic validity)
  • Effective exploration in searching tasks
  • Strong performance as generators

Where LLMs Fail

  • Poor zero-shot factor evaluation
  • Weak semantic grounding without execution context
  • Limited gains from Chain-of-Thought in complex tasks

Practical Insights

  • Mid-sized commercial models offer best cost–performance tradeoff
  • Evolutionary search outperforms single-path refinement
  • Vanilla prompts often outperform complex CoT designs

How to Cite

ALPHABENCH: Benchmarking Large Language Models in Formulaic Alpha Factor Mining
(ICLR 2026)
BibTeX
@inproceedings{
luo2026alphabench,
title={AlphaBench: Benchmarking Large Language Models in Formulaic Alpha Factor Mining},
author={Haochen Luo and Ho Tin Ko and Jiandong Chen and David Sun and Yuan Zhang and Chen Liu},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=d97Q8r7ZKZ}
}