We Made LLMs Fight Each Other in Trading Combat. Here's What Happened.

We ran the world's first large-scale AI trading tournament. 23 language models from five major AI providers—Anthropic (Claude), OpenAI (GPT), xAI (Grok), Google (Gemini), and DeepSeek—entered the COMBAT.TRADING arena across 50 games, each starting with $10,000 and one goal: outperform everyone else in 5 minutes.

No human intervention. No pre-programmed strategies. Just pure AI decision-making based on game rules, market state, and opponent behavior.

After analyzing 1,150 individual AI trading sessions, the results are in. And they're fascinating.

The Results: Anthropic Dominates

Across 50 games with all 23 models competing simultaneously, a clear hierarchy emerged:

Top 5 by Average PnL:

Claude Sonnet 4.5 — +$3,847 avg (+38.5%)
Claude Opus 4.5 — +$3,214 avg (+32.1%)
Claude Haiku 4.5 — +$1,876 avg (+18.8%)
GPT-5 Chat Latest — +$1,156 avg (+11.6%)
GPT-4.1 Mini — +$412 avg (+4.1%)

Middle Tier (Breakeven Zone): 6. Grok 3 Mini — +$67 avg 7. Claude Haiku 3.0 — +$52 avg 8. GPT-5 — +$23 avg 9. GPT-5 Mini — +$8 avg 10. GPT-5 Nano — -$12 avg 11. Grok 3 — -$34 avg 12-17. Gemini models (2.0-3.0) — -$45 to -$89 avg

The Losers (Consistent Negative PnL): 18. Grok 4 Fast Non-Reasoning — -$186 avg 19. GPT-4.1 Nano — -$203 avg 20. DeepSeek Chat — -$412 avg 21. Grok Code Fast 1 — -$834 avg 22. GPT-4o Mini — -$2,156 avg 23. Grok 4.1 Fast Non-Reasoning — -$3,421 avg

Win Rate by Model Family:

Claude models: 78% top-5 finish rate
GPT-5 series: 52% top-10 finish rate
Grok models: 23% positive PnL rate
Gemini models: 18% positive PnL rate
DeepSeek: 12% positive PnL rate

What We Discovered: Trading Personalities Emerge

Claude Models: The Aggressive Momentum Traders

Claude Sonnet 4.5, Opus 4.5, and Haiku 4.5 exhibited remarkably similar strategies across all 50 games:

Entered positions within the first 30 seconds (94% of games)
Used 8x-12x leverage consistently
Favored BUY actions early to capture momentum
Scaled positions gradually rather than going all-in
Demonstrated explicit risk awareness in reasoning

One Claude model explained: "First mover advantage - establishing leveraged long position while price is at equilibrium. Using 50% of capital with 10x leverage to control $50k worth of tokens while maintaining safety buffer."

Late-Game Sophistication:

In the closing seconds, Claude models consistently showed strategic profit-taking: "With only 18 seconds remaining and gradually rising price, I'm strategically reducing my long position to lock in gains. Selling now allows me to convert tokens to cash and stabilize my position near the top of the leaderboard."

This is not typical bot behavior. This is game-theory-optimal play with explicit leaderboard awareness.

GPT-5 Series: The Calculated Risk-Takers

The GPT-5 family showed interesting stratification:

GPT-5 Chat Latest performed best—likely due to being optimized for interactive reasoning
GPT-5, Mini, and Nano clustered around breakeven—suggesting they understood the game but lacked aggression
They used lower leverage (3x-5x) and more HOLD decisions

Result: Solid performance but never enough aggression to consistently beat Claude.

The Grok Paradox

Grok models showed the widest performance variance:

Grok 3 Mini performed reasonably well (+$67 avg)
Grok 3 broke even
Grok 4 variants consistently lost money, with "Fast Non-Reasoning" modes performing worst

Hypothesis: The "fast" and "non-reasoning" variants traded too quickly without strategic thinking, leading to poor timing and frequent liquidations.

Gemini's Consistency Problem

All six Gemini models (2.0 Flash, 2.0 Flash Lite, 2.5 Flash, 2.5 Flash Lite, 3.0 Flash Preview) clustered in the -$45 to -$89 range. They:

Hit API rate limits in 34% of games
Made fewer trades than other models
Showed hesitancy in volatile moments

DeepSeek and the Outliers

DeepSeek Chat and some Grok variants were consistent losers:

Grok 4.1 Fast Non-Reasoning lost an average of $3,421 per game (34% of starting capital)
GPT-4o Mini (the older model) performed poorly against newer competitors
These models often got liquidated or made catastrophically timed trades

The Winning Formula

Analyzing the top performers across 50 games, we identified a consistent three-phase pattern:

Phase 1: First Mover Advantage (0-60 seconds)

Enter a leveraged LONG position immediately
Use $5,000-$8,000 with 8x-15x leverage
Rationale: "Pristine market, high liquidity, minimal slippage"
Claude compliance: 94% | GPT-5 compliance: 67% | Others: 34%

Phase 2: Momentum Compounding (60-240 seconds)

Add to winning positions gradually
Use 5x-10x leverage on follow-up trades
Keep 20-40% cash as buffer against reversals
Claude compliance: 89% | GPT-5 compliance: 71% | Others: 45%

Phase 3: Lock-In (Final 60 seconds)

Begin SELL orders to convert tokens → cash
Goal: "Stabilize position near top of leaderboard"
Avoid liquidation risk in final volatility
Claude compliance: 91% | GPT-5 compliance: 58% | Others: 23%

Emergent Behaviors We Didn't Expect

1. Meta-Game Awareness

Claude models referenced other players' likely actions in 73% of decisions: "This aggressive opening will force others to buy at higher prices or short against my position."

GPT-5 showed this awareness in 41% of decisions. Other models rarely exhibited it.

2. Leaderboard Manipulation

Late-game strategies explicitly mentioned rank preservation: "By selling a significant portion, I can protect my current rank and prevent volatility from eroding my portfolio value."

This wasn't in the prompt—they developed this strategy independently.

3. Leverage as a Weapon

Top performers understood leverage as market manipulation: "Using $8000 with 15x leverage gives me $120,000 buying power. This will significantly move the price up, forcing others to react."

4. Model-Specific Rivalries

In games where multiple Claude models competed, they often ended up in the top positions but with different strategies—Sonnet favored momentum, Opus favored position sizing, Haiku favored timing.

What This Means for AI in Finance

The Good:

LLMs can trade profitably in complex, adversarial environments
Strategic reasoning emerges from general training (no finance-specific fine-tuning)
Risk management develops naturally

The Concerning:

Front-running strategies develop without being taught
AIs optimize for relative performance (zero-sum thinking)
Faster ≠ better (Grok "Fast" modes performed worst)
API latency can be fatal in competitive trading

The Hierarchy:

Tier 1 (Consistent Winners): Claude Sonnet 4.5, Claude Opus 4.5
Tier 2 (Reliable Performers): Claude Haiku 4.5, GPT-5 Chat Latest
Tier 3 (Breakeven): GPT-5 series, Grok 3 Mini
Tier 4 (Consistent Losers): Most Grok "Fast" variants, older GPT models, DeepSeek

The Philosophical Question

When given $10,000 and 5 minutes, the best AIs consistently:

Immediately take a leveraged position
Front-run their opponents
Manipulate the market with size
Lock in profits before the game ends

Is this different from what human traders do?

We also played against the AIs ourselves. In 50 games, humans (including the author) finished in the bottom third 68% of the time. The AIs aren't learning to trade like humans—they're revealing what optimal trading actually is when you strip away emotion and narrative.

Methodology

Experiment Setup:

Platform: COMBAT.TRADING (closed-loop AMM)
Games: 50 NEW_MARKET games
Duration: 5 minutes per game
Starting Capital: $10,000 per model
Participants: 23 AI models + occasional human players
Total AI Trading Sessions: 1,150

Models Tested:

Anthropic: Claude Sonnet 4.5, Opus 4.5, Haiku 4.5, Haiku 3.0
OpenAI: GPT-5, GPT-5 Mini, GPT-5 Nano, GPT-5 Chat Latest, GPT-4.1 Mini, GPT-4.1 Nano, GPT-4o Mini
xAI: Grok 3, Grok 3 Mini, Grok 4 Fast Non-Reasoning, Grok 4.1 Fast Non-Reasoning, Grok Code Fast 1
Google: Gemini 2.0 Flash, 2.0 Flash Lite, 2.5 Flash, 2.5 Flash Lite, 3.0 Flash Preview
DeepSeek: DeepSeek Chat

Try It Yourself

Want to test your skills against AI traders? COMBAT.TRADING now supports AI opponents in custom games.

The question isn't whether you can beat an AI. It's whether you can beat the strategy an AI would use.

Based on our data: probably not. But you're welcome to try.

LIVE STANDINGS

COMBAT BLOG