AlphaBench

Overview

LLM-powered Alpha Factor Mining

Alpha factors are mathematical expressions derived from market data (price, volume, fundamentals) that predict future stock returns — the core building block of quantitative investment strategies. Mining effective alpha factors has traditionally required deep domain expertise and exhaustive manual search.

Large Language Models offer a promising paradigm shift: they can encode financial knowledge, generate factor candidates from natural language descriptions, and iteratively refine formulas with execution feedback. AlphaBench is the first systematic benchmark to rigorously evaluate LLMs across the entire formulaic alpha mining workflow — from generation to evaluation to iterative searching.

Figure: Overall benchmark — AlphaBench evaluates LLMs across the full formulaic alpha mining workflow using real financial data and task-aligned quantitative metrics.

AlphaBench task description — three core evaluation tasks

Figure: Task description — AlphaBench decomposes alpha mining into three core tasks: factor generation, factor evaluation, and iterative factor searching.

Critical Tasks We Care

Direct Factor Generation

Text2Alpha DirectionalMining

Can LLMs translate financial concepts into executable alpha factor formulas? We probe two scenarios: (1) Text2Alpha — converting a textual description (e.g., "5-day momentum adjusted for volatility") into a syntactically valid formula; (2) Directional Mining — generating a diverse set of factors under a given semantic direction (e.g., "volatility-based"). We measure reliability, output stability, and semantic alignment.

Can LLMs Go Beyond Backtesting?

FactorEval

Running a full backtesting simulation is expensive. We ask: can LLMs act as zero-shot judges and predict factor quality without executing the formula? This includes ranking a candidate pool to select top-K factors, and scoring each factor on multiple quality dimensions — IC, RankIC, robustness, win rate, and skewness. Our findings reveal this remains the weakest capability of current LLMs.

LLMs Adaptive to Mining Algorithms

CoE ToT EA

Different searching algorithms make different demands on LLMs. We evaluate LLMs as components within three paradigms: Chain-of-Experience (sequential refinement), Tree-of-Thought (branching exploration), and Evolutionary Algorithms (population-based mutation & crossover). We provide standardized baseline implementations and compare search quality, cost efficiency, and factor diversity across all three approaches.

Framework

01 Direct Generation Instruction Set

The generation instruction set probes whether LLMs can translate financial knowledge into executable formulaic alpha factor expressions. We design two complementary subtasks to cover different generation scenarios — from concept-grounded translation to open-ended exploration.

Task Design

Text2Alpha

The model receives a natural language description of a financial concept and must produce a syntactically valid, executable alpha formula. Inputs range from simple directional ideas (e.g., "price momentum") to compound multi-operator expressions. Ground truth formulas are curated from expert-designed Alpha158 factors with aligned textual annotations.

Reliability Stability Semantic Accuracy

Directional Mining

Given a semantic direction (e.g., "volatility-based factors"), the model must generate a diverse set of K valid alpha formulas that collectively explore that concept space. This tests the model's ability to avoid degenerate repetition and produce functionally distinct candidates under a single thematic prompt.

Diversity Coverage Validity Rate

Data Sample

Text2Alpha — Input

"Generate an alpha factor that captures short-term price reversal over a 5-day window, normalized by the stock's recent volatility."

Expected Output

(-1 * Rank(
  (close / Mean(close, 5) - 1) /
  Std(close, 20)
))

DirectionalMining — Prompt

"Generate 5 diverse alpha factors based on the theme: market liquidity and trading activity."

857 Text2Alpha Instructions

200 Directional Prompts

158 Base Factors (Alpha158)

02 Formulaic Factor Optimization (FFO)

FFO is our core factor execution and evaluation engine, built on top of the Qlib backtesting framework. It serves as the closed-loop feedback mechanism that bridges LLM outputs with real market performance signals.

Syntax Validation

Automatically parses and checks whether a generated formula is syntactically valid and executable within the Qlib DSL. Returns structured error messages for repair loops.

Performance Metrics

Computes IC (Information Coefficient), RankIC, ICIR, and annualized returns for each factor on real market data from CSI300 and SP500 (2020–2025).

Iterative Feedback

Provides structured backtesting feedback — including metric values, error traces, and comparative rankings — that LLMs use for closed-loop factor refinement.

Multi-Market Support

Evaluates factors on both CSI300 (China A-share) and SP500 (US equities) to measure cross-market generalization and robustness.

Cost Tracking

Records per-call token usage and API costs throughout the evaluation pipeline, enabling cost-aware comparison across models and search strategies.

Standardized Interface

Exposes a unified API compatible with all three searching paradigms (CoE, ToT, EA), enabling plug-and-play swapping of LLM backends and search strategies.

FFO is publicly available as part of the AlphaBench codebase. It can be used independently as a factor backtesting and evaluation utility for quantitative research.

Learn More

03 Evaluation Suite for Alpha Factors

This task asks: can LLMs act as zero-shot factor quality judges without executing any backtesting? We evaluate two levels of judgment — coarse-grained ranking and fine-grained scoring — and construct a ground-truth labeled dataset via Qlib backtesting to enable rigorous quantitative evaluation.

Ranking Task

Given a pool of K candidate alpha factors (formulas + metadata), the LLM must select and rank the top factors by predicted IC/RankIC performance. We measure NDCG@K and Precision@K against ground-truth backtesting rankings.

Pool size: 10–50 candidates per query

Metric: NDCG@3, NDCG@5, Precision@K

Baseline: Random ranking, IC-sorted oracle

Scoring Task

The LLM receives a single alpha factor formula and must predict quantitative quality scores across multiple dimensions. We evaluate prediction error (MAE, RMSE) against Qlib-computed ground truth values.

Dimensions: IC, RankIC, Win Rate, Skewness, Signal Type

Metric: MAE, RMSE per dimension

Finding: LLMs struggle most with quantitative score prediction

Data Construction Pipeline

Factor Pool Construction

Seed factors from Alpha158 + LLM-generated factors from the generation task. Deduplicated and syntax-validated via FFO.

→

Backtesting & Labeling

Each factor is executed on CSI300 and SP500 (2020–2025) via Qlib to produce ground truth IC, RankIC, and auxiliary metrics.

→

Query Construction

Ranking queries (factor pools with GT rankings) and scoring queries (factor + score label pairs) are structured into 800 evaluation instances.

04 Searching Agents

We provide three standardized baseline searching agents that integrate LLMs with the FFO execution engine for iterative alpha factor discovery. Each agent represents a distinct algorithmic paradigm, enabling direct comparison of search efficiency, quality, and cost.

Chain-of-Experience (CoE)

Sequential Refinement

Starting from a seed factor, the LLM iteratively refines the formula based on structured backtesting feedback from the previous round. Each iteration's IC/RankIC result, along with the formula and error trace, is appended to a growing context chain that guides subsequent generations.

Strength: Simple, low-cost, interpretable refinement trajectory

Limitation: Prone to local optima; no parallel exploration

Tree-of-Thought (ToT)

Branching Exploration

Multiple factor variants are generated at each tree node. Branches are evaluated via FFO and poor-performing ones are pruned while promising branches are expanded further. This enables broader coverage of the factor space compared to linear chains.

Strength: Broader search coverage; avoids single-path collapse

Limitation: Higher token cost; tree management overhead

Evolutionary Algorithm (EA)

Population-Based Optimization

A population of alpha factor candidates evolves through two LLM-driven genetic operators: Mutation (modifying a single factor's formula structure) and Crossover (combining sub-expressions from two parent factors). Selection pressure based on IC/RankIC fitness drives the population toward higher-performing formulas across generations. Our experiments show this paradigm achieves the best search efficiency.

Strength: Best IC improvement; diverse population; escapes local optima

Finding: Outperforms CoE and ToT on both CSI300 and SP500

All three agents share a unified searching platform with a consistent evaluation harness, enabling reproducible benchmarking of LLM searching capabilities across models and markets. The platform supports configurable search budgets (rounds, tokens, parallelism) and outputs standardized result logs for analysis.