α
ICLR 2026 Conference Paper

AlphaBench

The First Comprehensive Evaluation Framework for
Formulaic Factor Mining by Large Language Models

A unified benchmark and toolchain for systematically evaluating LLMs in formulaic alpha factor mining, covering generation, evaluation, and searching tasks with executable financial DSLs.

Scroll to explore
13+
LLMs Evaluated
3
Core Tasks
1857
Instructions
3
Search Methods
Qlib
Backtesting Engine

Overview

LLM workflow for alpha factor generation and refinement

LLM-based Alpha Factor Mining Pipeline

This workflow illustrates how LLMs interact with prompts and backtesting engine (Qlib) to generate and refine formulaic alpha factors.

Prompt Engineering Task-specific prompts guide LLMs to generate factor expressions
Iterative Feedback Performance metrics (IC, RankIC) guide factor improvement
Multi-Phase Workflow Generation → Evaluation → Searching for optimal factors

Core Problem

Recent works show that LLMs can generate alpha factors, but their true capabilities remain unclear:

  • Can LLMs reliably generate valid, executable factor formulas?
  • Can they evaluate factor quality without backtesting?
  • Can they efficiently search the combinatorial factor space?
There is no standardized benchmark to answer these questions.

Our Solution

AlphaBench is the first systematic benchmark designed to evaluate LLMs across the entire formulaic alpha mining workflow, using:

  • Executable factor expressions
  • Real financial data
  • Task-aligned quantitative metrics
AlphaBench decomposes alpha mining into generation, evaluation, and searching, allowing fine-grained analysis of model behavior, robustness, cost, and failure modes.

Key Features

End-to-End Coverage

Covers the full lifecycle of alpha factors: generation → evaluation → searching

Searching Baselines

Standardized searching baselines for LLM-driven factor discovery, including chain-based refinement, tree-based exploration, and evolutionary algorithms

Task-Specific Metrics

Reliability, stability, accuracy, ranking precision, scoring error, search improvement, and cost

Multiple Search Paradigms

Supports Chain-of-Experience, Tree-of-Thought, and Evolutionary Algorithms as comparable baselines

Qlib-based Evaluation

Built on Qlib backtesting framework, enabling iterative factor search and automated evaluation with real market data

Cost-Aware Analysis

Explicit token usage and efficiency evaluation for real-world deployment

Framework

Generation and Evaluation Tasks

Generation & Evaluation Tasks

AlphaBench evaluates LLMs through structured tasks that mirror real-world factor mining workflows.

Text2Alpha Translate financial concepts into executable formulas
Directional Mining Generate diverse factors under semantic directions
Ranking & Scoring Judge factor quality without full backtesting
Task 1

Factor Generation

LLMs transform natural language descriptions into candidate alpha factors

Subtasks

Text2Alpha

From abstract financial concepts (e.g., momentum, mean reversion) to executable formulas

Directional Mining

Generate multiple diverse factors under a given semantic direction (e.g., volatility, trend)

Natural Language "momentum" LLM Alpha Factor Rank( $close ) / Mean( $close,20

Evaluation Metrics

Reliability (syntactic validity) Stability (output consistency) Accuracy (semantic alignment)
Task 2

Factor Evaluation

LLMs act as intelligent evaluators or "judges", predicting factor quality without full backtesting

Subtasks

Ranking

Select top-K factors from a candidate pool

Scoring

Predict factor quality (signal type, performance, robustness, win rate, skewness)

Factor Pool LLM Judge Output Ranking 1 #1 2 #2 Scoring 0.72
Zero-shot factor evaluation remains the weakest capability of current LLMs
Task 3

Factor Searching

Evaluate whether LLMs can iteratively improve factors under different search paradigms

Supported Search Methods

Chain-of-Experience (CoE)
Tree-of-Thought (ToT)
Evolutionary Algorithms (Mutation & Crossover)

Metrics

Search quality (performance improvement) Search cost (tokens, rounds) Reliability and diversity

Search Paradigms Illustrated

Chain-of-Experience

Chain-of-Experience (CoE)

Sequential refinement: Starting from a seed factor, the LLM iteratively improves the formula based on execution feedback.

Tree-of-Thought

Tree-of-Thought (ToT)

Branching exploration: Multiple factor variants are generated at each node, with poor branches pruned based on performance.

Evolutionary Algorithm

Evolutionary Algorithm (EA)

Population-based optimization: Factors evolve through mutation and crossover. Selection pressure based on IC/RankIC fitness drives the population toward higher-performing formulas. This paradigm shows superior search efficiency.

Experimental Setup

Markets

  • CSI300 (China A-share market)
  • SP500 (US market)

Initial Factor Pool

  • Alpha158 (Qlib)

Time Span

  • 2020 – 2025

Backtesting Engine

  • Qlib

Key Findings

Model Performance Comparison Radar Charts

Model Performance Comparison

Comprehensive evaluation across all tasks reveals distinct patterns in LLM capabilities.

Generation & Evaluation Measuring reliability, stability, accuracy, ranking precision
Searching Performance IC improvement, cost efficiency, exploration diversity
Key Insight Mid-sized commercial models offer best cost-performance tradeoff

What LLMs Do Well

  • Reliable factor generation (high syntactic validity)
  • Effective exploration in searching tasks
  • Strong performance as generators

Where LLMs Fail

  • Poor zero-shot factor evaluation
  • Weak semantic grounding without execution context
  • Limited gains from Chain-of-Thought in complex tasks

Practical Insights

  • Mid-sized commercial models offer best cost–performance tradeoff
  • Evolutionary search outperforms single-path refinement
  • Vanilla prompts often outperform complex CoT designs

Beyond Quant Finance

AlphaBench is a portable evaluation template for any executable DSL reasoning task, including:

Symbolic regression Feature engineering Constraint solving Planning and optimization DSLs A/B testing metric design

Scalability

Large instruction sets
Multiple models
Multiple markets
Token-level cost analysis

Designed for research benchmarking and real-world system design.

How to Cite

ALPHABENCH: Benchmarking Large Language Models in Formulaic Alpha Factor Mining
(ICLR 2026)
BibTeX
@inproceedings{luo2026alphabench,
  title={AlphaBench: Benchmarking Large Language Models in Formulaic Alpha Factor Mining},
  author={Haochen Luo and Ho Tin Ko and Jiandong Chen and David Sun and Yuan Zhang and Chen Liu},
  booktitle={International Conference on Learning Representations},
  year={2026}
}