We Built an AI Market Forecasting Engine That Actually Keeps Score

We Built an AI Market Forecasting Engine That Actually Keeps Score

# ai
We Built an AI Market Forecasting Engine That Actually Keeps Scoretradehorde

Everyone in fintech is building AI trading agents right now. Most of them are demos. Ours isn't. We...

Everyone in fintech is building AI trading agents right now. Most of them are demos. Ours isn't.

We spent the better part of a year studying multi-agent trading frameworks — open-source research projects, commercial signal platforms, arena-style competitions — trying to understand what actually works and what's just a parlor trick with a stock ticker. Most of what's out there falls into two categories: academic prototypes that can't survive contact with live markets, or black-box signal services that never tell you how they actually performed.

We built TradeHorde to be neither.

TradeHorde Signal Radar

What We Actually Built

TradeHorde is "Collective Intelligence for Markets," and the name isn't just branding. The core idea is simple but surprisingly rare in practice: instead of one AI model giving you one opinion on a trade, a swarm of specialized agents analyze the same opportunity from different angles — technical, fundamental, sentiment, macro regime — and the system synthesizes their perspectives into a unified view.

Think of it less like asking ChatGPT about a stock and more like staffing a research desk with analysts who are each domain experts, then letting them debate before publishing a note.

Under the Hood — How the Horde Actually Works

Here's what the analysis view looks like in practice. When you run an analysis on AMZN, the system doesn't call one model and format the output. It runs multiple distinct LLMs simultaneously and forces each of them to argue both sides.

In this AMZN analysis, four models weighed in:

  • GPT-5.2 — OpenAI's latest, leaned short at 68% conviction
  • Claude-Opus-4.5 — Anthropic's flagship, leaned short at 68%
  • Gemini-2.5-Pro — Google's model, contested — made a bull case citing strong technical momentum and earnings history
  • Claude-Sonnet-4.5 — Anthropic's fast reasoning model, also contested — argued AMZN presented a compelling long opportunity

Every model is required to produce both a bull case and a bear case before declaring a side. All four models wrote bear cases. GPT-5.2 identified a low-volume "air pocket" near resistance that could accelerate a selloff toward the Point of Control at $232.38. Claude Opus flagged the same volume node, plus a market regime transitioning from bull to bear at 72% confidence, heavy AI capex spending creating margin pressure, and the 16,000 layoffs signaling management concern about profitability. On the bull side, Gemini pointed to earnings outperformance history and rising RSI, while Claude Sonnet saw converging technical and fundamental tailwinds.

The system synthesized across all four models into Lean Bearish — but critically, it labeled the conviction as Low, noting "Only 50% voted — not enough for consensus." Two models went short, zero went long, and two were contested. The system didn't pretend it had high confidence. It told you exactly how much agreement existed and how much didn't.

The structured trade setup gave entry at $244.45–$244.50, stop at $248.50–$250.00, and target at $232.38–$232.50 — roughly a 2.5:1 reward-to-risk ratio on the short.

The result: WIN. +4.9%, +3.0R. AMZN moved from $244.47 to $232.44 in two days and nine hours, hitting the target zone almost exactly. The Point of Control at $232.38 that GPT-5.2 and Claude Opus both independently identified as a "magnetic downside target" turned out to be the actual bottom of the move.

This is what we mean by collective intelligence. Two models saw the short clearly. Two were uncertain. The system reported that ambiguity honestly — and the trade still worked because the bears had specific, convergent reasoning (the volume void, the regime shift, the resistance rejection) while the bulls had more generic arguments (momentum, earnings history). Quality of reasoning matters, not just vote count.

Why this model diversity matters: GPT-5.2, Claude Opus 4.5, Gemini 2.5 Pro, and Claude Sonnet 4.5 each have fundamentally different training data, reasoning architectures, and cognitive biases. This isn't four copies of the same brain — it's four genuinely different perspectives. When they converge on specific price levels and market mechanics, that convergence carries weight. When they disagree, the nature of the disagreement itself is informative. That's the Tetlock principle made operational: the value isn't in any single forecaster, it's in the synthesis across forecasters who think differently.

Why We Designed It This Way

The research that shaped our architecture came from two places.

The first was Philip Tetlock's work on superforecasters — the people who consistently predict future events better than experts and prediction markets. Tetlock found two things that directly influenced how we built this:

Teams of forecasters dramatically outperform individuals. Not because any single team member is smarter, but because synthesizing multiple perspectives catches blind spots that any single viewpoint misses. That's why we run multiple specialized agents rather than one generalist model.

The best forecasters keep score. They track their accuracy obsessively, identify where they're systematically wrong, and adjust. Most pundits, analysts, and AI tools never do this. We decided from day one that we would.

The second influence was ForecastBench, the dynamic benchmark run by the Forecasting Research Institute. Their data shows that single LLMs are improving fast at real-world prediction — projected to match superforecaster-level accuracy within a couple of years — but they still struggle with calibration. They hedge toward 50%, they overweight recent news, and they construct plausible narratives instead of doing probabilistic math.

Multi-agent architectures are the most promising path to closing that gap. That's the bet we made.

The Feature That Changes Everything

There are plenty of AI tools that will generate a trade signal. TradeHorde does that too — users can run an analysis on any ticker or theme, and the platform produces a structured view with directional conviction and key levels.

But the feature we're most proud of is Outcomes.

We track what happens after every signal the system generates. Not buried in fine print, not in a quarterly report nobody reads — it's a first-class feature in the navigation, right next to the signals themselves. Users can see which calls hit, which missed, and by how much.

This seems like it should be table stakes. It isn't. Almost no AI trading tool voluntarily exposes its track record. The ones that do typically cherry-pick. We put it all out there because we believe that if you're not willing to be measured, you shouldn't be making claims.

We also knew that publishing our results would force us to get better. There's nothing like a public scorecard to sharpen your engineering.

The Numbers (And Why They Matter More Than You'd Think)

The Outcomes dashboard lets users filter by conviction tier, and the comparison between tiers reveals something we're genuinely proud of — evidence that the multi-agent consensus mechanism is working.

High and Conditional-conviction signals only (where 3+ models reached consensus with high confidence):

  • 67.9% win rate — 20 wins, 11 losses across 31 signals
  • +137.7% total return
  • 3.77 profit factor — gross profits nearly 4x gross losses
  • +1.2R average R-multiple (risk-adjusted)
  • Average win: +6.4% (2.5R) vs average loss: -2.7% (-1.0R)

A 3.7 profit factor is exceptional by any standard. Anything above 2.0 is considered strong. Above 3.0 is elite. We're at nearly 4:1.

All conviction tiers combined (65 signals total):

  • 46% win rate — 30 wins, 35 losses
  • +47.4% total return
  • 1.4 profit factor
  • +0.6R average R-multiple
  • Average win: +6.5% (2.5R) vs average loss: -4.2% (-1.0R)

At first glance, 46% looks like a coin flip. But the system is still profitable because winners average 6.5% while losers average only -4.2% — a favorable asymmetry that produces positive expectancy even below a 50% hit rate.

But here's what actually matters. Compare the two views side by side, and a clear pattern emerges: the average win size is virtually identical regardless of conviction level (6.4% vs 6.5%, both 2.5R). When the system is right, it's right by about the same amount whether conviction is high or low. But the average loss is much worse on lower-conviction signals (-4.2% vs -2.7%). Weak-conviction trades that fail don't just lose more often — they lose bigger.

This means the conviction filter isn't cosmetic. It's doing exactly what we designed it to do: the system knows when it knows. When the multi-agent horde reaches strong consensus, the signals are dramatically better — not marginally, but 3.7 vs 1.41 profit factor. That kind of stratification between conviction tiers is one of the hardest things to achieve in any forecasting system, and it maps directly to the calibration problem that single-model LLMs consistently fail at in benchmarks like ForecastBench.

The full three-tier breakdown makes this even clearer.

The Signal Map — Where You Can Actually See It Working

The Outcomes page also includes a visual signal map that makes the conviction story viscerally obvious. Every signal is plotted as a dot — green for long, orange for short — with a progress ring around it that shows green when the trade is in profit and red when it's underwater. The X-axis is conviction score, and signals move through lifecycle stages from Pending down through New, Active, and Resolved.

The resolved row at the bottom is where you can see the full history, and the color pattern tells the story at a glance:

  • Low conviction (below ~50): 29.0% win rate across 34 signals — overwhelmingly red progress rings, especially on the short side. The left side of the chart is a graveyard.
  • Conditional conviction (~50-70): 62% win rate across 21 signals — the color flips. Green progress rings dominate, and the system is calling direction correctly on both longs and shorts.
  • High conviction (~75-100): 70% win rate across 10 signals — consistent with Conditional, reinforcing that the threshold matters more than the gradient.

Monotonic Conviction Tier

The conviction spectrum isn't linear its monotonic. It's closer to a step function: below ~50, the signals are actively harmful. Above ~50, there's a real edge. That's a clean, actionable threshold — and one that's visually obvious from the chart in a way that a table of numbers can't convey.

There's a subtler insight in the data too. The low-conviction zone is disproportionately failed shorts — orange dots with red rings clustered on the left. This makes intuitive sense. Short calls are inherently harder, and when the agent ensemble can't reach consensus on a short, it's often fighting the broader structural drift in equities. The conviction filter isn't just separating good from bad — it's especially good at filtering out the dangerous shorts where the horde is unsure.

The upper rows of the map show 34 currently active signals moving through the pipeline in real time — users can watch which signals are tracking green and which are struggling before they resolve. It's a live portfolio view, not just a historical report card.

We're honest about the caveat: 65 total signals is still a relatively small sample. We need to see these numbers hold across 100+ signals and different market regimes before claiming victory. But as early-stage performance data goes, we're encouraged — and the fact that we're publishing it at all is the point.

How the Product Works

The user workflow is designed around a complete analytical loop:

Browse lets users scan what analyses are already live — a quick pulse on what the system and other users are focused on. Analyze is the engine — submit any ticker or theme and the horde goes to work. The output isn't a single bullish or bearish take; it's a structured synthesis that acknowledges tension between different signals. When the technical picture is bullish but macro conditions are deteriorating, the system doesn't just average those into a lukewarm "hold." It surfaces the conflict explicitly and delivers a conviction-weighted view.

Signals distills the analysis into actionable output — direction, conviction, key levels. Outcomes tracks what happened. And Research provides deeper thematic work — regime analysis, cross-asset dynamics, event-driven setups — for users who want macro framing beyond individual tickers.

The whole loop feeds back into itself. Every resolved signal is data that makes the next signal better.

Where We Sit in the Landscape

We built TradeHorde knowing the space was getting crowded. Here's how we see our position:

TradingAgents (Tauric Research) is the most cited open-source multi-agent framework — great for researchers, but it's a backtesting tool, not a live product.

FinRobot (AI4Finance Foundation) is a broader platform integrating LLMs with quant methods, but it's aimed at developers, not end users.

RockAlpha / AI-Trader run live competitive arenas where different LLMs trade real capital — fascinating for benchmarking but more of a spectator sport than a decision-support tool.

CrowdWisdom Trading aggregates human trader opinions using AI — interesting hybrid but dependent on the quality of the humans in the crowd.

We occupy a different position: the multi-agent depth of the research frameworks, the accountability of a competitive arena, and the usability of a consumer product. The closed feedback loop — generate, signal, track outcomes, improve — is the architectural choice that ties it all together.

What's Next

We're early, and we know it. Here's where we're headed:

More transparency on the agent architecture. We want the community to understand what's under the hood — how many agents are involved, what their specializations are, how the synthesis works. The more people understand the methodology, the more trust we earn.

Finetuning our consensus layer. As our sample size of resolved outcomes grows, so does our ability to replay all that data and fine tune our consensus layer.

More signals, more data, more regimes. The numbers are strong so far. Now we need to prove they hold at scale, in drawdowns, in regime shifts, across asset classes. That's the real test, and we intend to take it in public.

The Bottom Line

The AI trading space is crowded with tools that generate confident-sounding analysis but never face consequences for being wrong. We built TradeHorde to be the opposite: a system that keeps score, publishes its results, and gets better because of it.

The numbers back it up: a 3.7 profit factor and 137% total return on high-conviction consensus signals is not a demo. It's not a backtest. It's published, auditable performance on live market calls. And the clean stratification between conviction tiers — showing the system genuinely knows when its confidence is warranted — is the kind of calibration that most AI forecasting systems can't achieve.

It's early. The sample size will grow. But the architecture is right, the transparency is real, and the results so far speak for themselves.

tradehorde.ai


TradeHorde provides market analysis and ideas only. It does not execute trades or provide financial advice. All trading involves risk. Past performance does not guarantee future results.