LLMs Are Closing the Gap on Human Superforecasters
We opened our AI forecasting benchmark to external submissions. Here’s what happened.
(Post written by Houtan Bastani, Simas Kučinskas, and Matt Reynolds)
In October, we opened our AI forecasting benchmark, ForecastBench, to external submissions. The challenge: beat superforecasters using any tools available, from scaffolding to fine-tuning. Several teams responded, including xAI, Cassi, Lightning Rod, and Mantic. We thank all of them for participating on this challenging benchmark.
The result? External submissions now hold #2 and #3 on our leaderboard, outperforming all our baseline LLM configurations. Superforecasters still hold #1.
Here are the current standings (lower Brier scores are better):
Two external submissions—xAI’s Grok 4.20 (Preview) and Cassi’s ensemble_2_crowdadj—are tied at #2, both ahead of our own runs of GPT-5, o3, and Claude Sonnet 4.5. Superforecasters are #1, leading state-of-the-art LLMs by 0.017 Brier points.
How large is the gap between LLMs and superforecasters? The current 0.017-point gap represents about one year of LLM progress. For comparison, Claude 3.5 Sonnet (released October 2024) achieved a Brier score of 0.117; a year later, Grok 4.20 (Preview, run in October 2025) achieved 0.102—an improvement of 0.015 points. If this trend continues, we extrapolate LLM-superforecaster parity in November 2026 (95% CI: January 2026 – November 2027). Dataset question parity is estimated for June 2026, and market question parity for August 2026.
We use difficulty-adjusted Brier scores to compare forecasters even when they predicted on non-overlapping question sets. For instance, our initial superforecaster sample and the latest models on ForecastBench have zero overlapping questions. For details, see our technical report.
To learn more, we asked the top-performing external teams to share their methods. Here’s what the two top teams told us about how they approached ForecastBench.
xAI took a minimal approach: give the model the question, provide access to standard Grok tools (X search, web search, Python REPL), generate eight forecasts, and average them. Notably, the team used an early preview version of Grok 4.20 and expects the full 4.20 release to perform even better.
In contrast, Cassi used a multi-stage pipeline built around retrieval and ensembling. The system generates sub-questions and search queries, retrieves up-to-date context via Tavily, and filters for relevance and recency. It then generates forecasts using an ensemble of models (o3 and GPT-5 in this run), with another LLM analyzing the reasoning to produce a final forecast.
For questions drawn from prediction platforms, Cassi adds an additional step: comparing the final forecast to the market price and, if they disagree significantly, giving an LLM a chance to review and adjust. This “crowd adjustment” helps: Cassi’s ensemble_2_crowdadj outperforms its ensemble_2 by nearly 0.01 Brier points. Cassi notes they do not use post-forecast calibration, finding that current LLMs already produce broadly well-calibrated forecasts. For more on Cassi’s work, see the company’s website and Substack.
More models are joining the leaderboard soon. We add models 50 days after their first forecast to ensure enough questions have resolved for stable rankings. In late February, GPT-5.2 and Gemini 3 Flash will appear on the leaderboard, while GLM-4.7 will become visible in late March.
Think you can do better? The Tournament leaderboard is open for external submissions. See our wiki for instructions. Explore the leaderboards at forecastbench.org and download full datasets at forecastbench.org/datasets.




AI models generalize well if they are trained on a large amount of data. If we talk about the text-generating AI models, they need an enormous amount of text to better understand all the possible contexts and hence generate accurate responses. These texts have to be processed first and then used in a deep neural network model (e.g., transformers having billions of parameters). Training such a model would require lots of expensive hardware resources and electrical power.
Additionally, challenges such as inference latency, model size, hallucinations, and deployment constraints make optimization difficult.