AlphaBench

Overview

LLM workflow for alpha factor generation and refinement

LLM-based Alpha Factor Mining Pipeline

This workflow illustrates how LLMs interact with prompts and backtesting engine (Qlib) to generate and refine formulaic alpha factors.

Prompt Engineering Task-specific prompts guide LLMs to generate factor expressions

Iterative Feedback Performance metrics (IC, RankIC) guide factor improvement

Multi-Phase Workflow Generation → Evaluation → Searching for optimal factors

Core Problem

Recent works show that LLMs can generate alpha factors, but their true capabilities remain unclear:

Can LLMs reliably generate valid, executable factor formulas?
Can they evaluate factor quality without backtesting?
Can they efficiently search the combinatorial factor space?

There is no standardized benchmark to answer these questions.

Our Solution

AlphaBench is the first systematic benchmark designed to evaluate LLMs across the entire formulaic alpha mining workflow, using:

Executable factor expressions
Real financial data
Task-aligned quantitative metrics

AlphaBench decomposes alpha mining into generation, evaluation, and searching, allowing fine-grained analysis of model behavior, robustness, cost, and failure modes.

Key Features

End-to-End Coverage

Covers the full lifecycle of alpha factors: generation → evaluation → searching

Searching Baselines

Standardized searching baselines for LLM-driven factor discovery, including chain-based refinement, tree-based exploration, and evolutionary algorithms

Task-Specific Metrics

Reliability, stability, accuracy, ranking precision, scoring error, search improvement, and cost

Multiple Search Paradigms

Supports Chain-of-Experience, Tree-of-Thought, and Evolutionary Algorithms as comparable baselines

Qlib-based Evaluation

Built on Qlib backtesting framework, enabling iterative factor search and automated evaluation with real market data

Cost-Aware Analysis

Explicit token usage and efficiency evaluation for real-world deployment

Framework

Generation & Evaluation Tasks

AlphaBench evaluates LLMs through structured tasks that mirror real-world factor mining workflows.

Text2Alpha Translate financial concepts into executable formulas

Directional Mining Generate diverse factors under semantic directions

Ranking & Scoring Judge factor quality without full backtesting

Task 1

Factor Generation

LLMs transform natural language descriptions into candidate alpha factors

Subtasks

Text2Alpha

From abstract financial concepts (e.g., momentum, mean reversion) to executable formulas

Directional Mining

Generate multiple diverse factors under a given semantic direction (e.g., volatility, trend)

Evaluation Metrics

Reliability (syntactic validity) Stability (output consistency) Accuracy (semantic alignment)

Task 2

Factor Evaluation

LLMs act as intelligent evaluators or "judges", predicting factor quality without full backtesting

Subtasks

Ranking

Select top-K factors from a candidate pool

Scoring

Predict factor quality (signal type, performance, robustness, win rate, skewness)

Zero-shot factor evaluation remains the weakest capability of current LLMs

Task 3

Factor Searching

Evaluate whether LLMs can iteratively improve factors under different search paradigms

Supported Search Methods

Chain-of-Experience (CoE)

Tree-of-Thought (ToT)

Evolutionary Algorithms (Mutation & Crossover)

Metrics

Search quality (performance improvement) Search cost (tokens, rounds) Reliability and diversity

Search Paradigms Illustrated

Chain-of-Experience (CoE)

Sequential refinement: Starting from a seed factor, the LLM iteratively improves the formula based on execution feedback.

Tree-of-Thought (ToT)

Branching exploration: Multiple factor variants are generated at each node, with poor branches pruned based on performance.

Evolutionary Algorithm (EA)

Population-based optimization: Factors evolve through mutation and crossover. Selection pressure based on IC/RankIC fitness drives the population toward higher-performing formulas. This paradigm shows superior search efficiency.

Key Findings

Model Performance Comparison

Comprehensive evaluation across all tasks reveals distinct patterns in LLM capabilities.

Generation & Evaluation Measuring reliability, stability, accuracy, ranking precision

Searching Performance IC improvement, cost efficiency, exploration diversity

Key Insight Mid-sized commercial models offer best cost-performance tradeoff

What LLMs Do Well

Reliable factor generation (high syntactic validity)
Effective exploration in searching tasks
Strong performance as generators

Where LLMs Fail

Poor zero-shot factor evaluation
Weak semantic grounding without execution context
Limited gains from Chain-of-Thought in complex tasks

Practical Insights

Mid-sized commercial models offer best cost–performance tradeoff
Evolutionary search outperforms single-path refinement
Vanilla prompts often outperform complex CoT designs

How to Cite

ALPHABENCH: Benchmarking Large Language Models in Formulaic Alpha Factor Mining

(ICLR 2026)

BibTeX

@inproceedings{luo2026alphabench,
  title={AlphaBench: Benchmarking Large Language Models in Formulaic Alpha Factor Mining},
  author={Haochen Luo and Ho Tin Ko and Jiandong Chen and David Sun and Yuan Zhang and Chen Liu},
  booktitle={International Conference on Learning Representations},
  year={2026}
}

Download Paper View on OpenReview

Overview

LLM-based Alpha Factor Mining Pipeline

Core Problem

Our Solution

Key Features

End-to-End Coverage

Searching Baselines

Task-Specific Metrics

Multiple Search Paradigms

Qlib-based Evaluation

Cost-Aware Analysis

Framework

Generation & Evaluation Tasks

Factor Generation

Subtasks

Evaluation Metrics

Factor Evaluation

Subtasks

Factor Searching

Supported Search Methods

Metrics

Search Paradigms Illustrated

Chain-of-Experience (CoE)

Tree-of-Thought (ToT)

Evolutionary Algorithm (EA)

Experimental Setup

Markets

Initial Factor Pool

Time Span

Backtesting Engine

Key Findings

Model Performance Comparison

What LLMs Do Well

Where LLMs Fail

Practical Insights

Beyond Quant Finance

Scalability

How to Cite