AI Beat Human Forecasters at Startups, Hybrid Mix Made It Worse

Ishan Crawford 4 hours ago 0 4

Google’s Gemini 2.5 Pro correctly ranked nearly four out of five startup pairs by future fundraising success. The 346 human managers it competed against managed three out of five at their best; many landed barely above a coin flip. Both numbers come from a prospective tournament run by researchers at the University of Michigan, NYU and Indiana University, published in late May on arXiv as The Strategic Foresight of LLMs.

The result that should unsettle every chief data officer sits one line deeper in the abstract. Combining AI forecasts with human judgement, the architecture most companies use for high-stakes decisions, made the predictions worse than the AI working alone.

What the 870 Comparisons Showed

The tournament asked frontier large language models (LLMs, software that generates text from massive training corpora) and human evaluators to rank 30 US-based crowdfunding ventures by how much money each would raise. All 30 projects launched after the AI systems’ training cutoffs, so no model had memorised the answer. The forecasts went in before any campaign closed.

Gemini 2.5 Pro produced a rank correlation of 0.74 with the eventual fundraising tallies. Several other frontier models cleared 0.60. The 346 managers recruited through Prolific landed between 0.04 and 0.45. Three MBA-trained investors working under monitored conditions did not beat the strongest software.

Then the researchers blended the two. They pooled human rankings with model rankings the way a wisdom-of-the-crowd workflow would, expecting the usual lift. They got the opposite. Felipe Csaszar, the Alexander M. Nick Professor of Strategy at Michigan Ross, calls the finding the augmentation trap, and it is the part of the paper most likely to outlast the headline number.

AI forecasting tournament study shows Gemini outperforming human startup investors.

How a Live Kickstarter Tournament Was Built

Csaszar, with co-authors Aticus Peterson and Daniel Wilde, designed the experiment to close every escape hatch a sceptical reader might reach for. Past LLM forecasting tests have leaked: a model trained on data through January cannot be tested on a December event without contamination. So the team pulled 30 live Kickstarter campaigns in the technology category, each launched after the most recent model in the field had finished training.

Every venture was scored by software and by humans before its campaign deadline. The eventual money raised gave the team a hard ground truth. From there, 870 pairwise comparisons were run through the models, producing a complete ranking of which projects would outperform which.

The human side had two arms. The first was 346 experienced managers recruited via Prolific, each handling a slice of the same pairwise task. The second was three MBA-trained investors working under closer supervision, the kind of evaluator a venture firm might actually employ for early diligence.

Methodology choices worth flagging:

Prospective design. Predictions went in before outcomes were known, ruling out hindsight or training-set leakage.
Pairwise scoring asked which of two projects would raise more, sidestepping the calibration problem of demanding absolute dollar forecasts.
The model line-up mixed frontier proprietary systems with open-weight contenders, not just one company’s release.
Crowdsourced managers and MBA-trained investors were tested in separate cohorts, so the comparison was not just AI versus amateurs.

Where Each Model Landed

Forecaster	Rank Correlation	Pairs Ranked Correctly
Gemini 2.5 Pro (Google)	0.74	About 4 in 5
Other frontier models	Above 0.60	Roughly 3.5 in 5
Strongest human cohort	0.45	About 3 in 5
Weakest human cohort	0.04	Barely above chance
AI plus human blend	Below the top AI score	Worse than AI alone

The spread inside the human results is as instructive as the gap to AI. Prolific evaluators scattered widely; a few clustered near the trained investors, most sat closer to chance. That dispersion is what statisticians call noise, and it is precisely the input that hurts a hybrid pipeline when a human vote is averaged into a clean machine signal.

The gap at the top is smaller than the gap at the bottom. The strongest human forecaster reached 0.45 while the weakest fell to 0.04. The leading model exceeded 0.70, with the next tier above 0.60. Said differently, even mid-pack frontier models match the best human cohort, and the leading model laps the rest entirely.

Why Mixing Humans In Made It Worse

The augmentation trap was not a bug discovered halfway through analysis. It was the design experiment the team ran on purpose, expecting to find the opposite. Decades of wisdom-of-the-crowds work suggests that aggregating independent guesses, even mediocre ones, beats any single source. The premise broke when one source was already extracting more signal than the others could add.

In this case, the wisdom-of-the-crowd logic doesn’t produce an improvement in accuracy. If you include a human in the mix, performance decreases.

That was Csaszar, speaking when the working paper was released. The mechanism, as the authors describe it, is the variation Daniel Kahneman, Olivier Sibony and Cass Sunstein named noise in their 2021 book by that title. Noise is what makes a credit officer lenient on Tuesday and strict on Friday for the same file. When that scatter is averaged into an AI prediction that has already discounted the signal cleanly, the average does not steady the answer. It pulls it sideways.

The hybrid result is bad news for one of the most quietly assumed architectures in enterprise AI. The default deployment pattern across venture capital screening, lending, hiring software and content moderation has been simple: let the model rank, let a human sign off. The assumption is that the human catches errors the model misses. The Kickstarter tournament suggests the human is more likely to introduce errors of their own.

There is a narrower way to read it. The task was a forecasting one, and the human contribution was not deep domain expertise but cohort variation. A senior partner at Sequoia might add more than a Prolific recruit. The paper does not test that. But it does suggest the burden of proof has shifted, and a hybrid pipeline now needs to show its lift instead of assuming it.

How AI Escapes Bounded Rationality

The team’s framing borrows from Herbert Simon, the Nobel economist who argued in the 1950s that humans make decisions inside the limits of their time, memory and information processing. Simon called the limit bounded rationality, and the term has shaped half a century of behavioural economics. The Michigan paper’s contribution is to point out which boundary AI no longer respects.

“No human has read as much as ChatGPT; no human has as much time to think about each project,” the lead author said when the results were released. A frontier model can hold years of crowdfunding patterns in working context, run thousands of comparisons in the time a manager evaluates one, and never tire of the 1,247th pairwise question.

The authors link forecasting performance to scores on reasoning benchmarks such as Humanity’s Last Exam, the hard test built by the Center for AI Safety and Scale AI to resist memorisation. Models that do well on advanced reasoning, the data suggests, also do well on strategic foresight. If predicting startup success simply tracked pattern-matching against past Kickstarters, smaller domain-specific tools would win. The fact that broad reasoning ability predicts foresight is the part of the paper that scales beyond crowdfunding to anything else a strategist might do.

The Verdict for Human-in-the-Loop Systems

For three years, human-in-the-loop has been the safety phrase regulators reach for and vendors print on slide 14 of every enterprise deck. The OECD AI Principles flag it. The EU AI Act bakes it into its high-risk categories. The US National Institute of Standards and Technology AI Risk Management Framework names it as a control. The Michigan paper does not invalidate that framing for ethics or accountability. It complicates it for accuracy.

A few numbers worth pinning to the wall:

0.74 rank correlation, the ceiling a single frontier model reached without human input.
0.04 rank correlation, the floor a human cohort hit, almost indistinguishable from random.
30 ventures produced the entire dataset, small enough that confidence intervals will matter for any replication attempt.
Zero measurable reasoning value humans added when their forecasts were averaged with the model’s.

Business adoption is climbing regardless; recent UK adoption data from Scotland shows 96 percent of AI-using firms reporting higher productivity, on the assumption a human still checks the output. The practical reading for a venture firm or corporate strategy team is uncomfortable. If the deal-screening tool already runs a frontier model, the analyst’s mark-up may not be improving the score. It may be dragging it down by an amount the firm has never measured because nobody runs the counterfactual.

That does not mean firing the analyst. The analyst does work the model cannot: sourcing the deal, building the relationship, sitting in the room when the founders argue. But the part where an analyst overrides the model’s ranking with personal conviction now needs an audit trail. For regulated sectors, the trap is sharper. Credit underwriting, insurance pricing and clinical triage have all moved toward AI scoring with human sign-off, and the Michigan result suggests that sign-off may be introducing bias regulators have not yet thought to police.

What the Paper Stops Short of Claiming

The authors were deliberate when asked whether this counts as a “Deep Blue moment” for business strategy, the IBM chess milestone of 1997. They declined the comparison. The tournament write-up published by Michigan Ross tested forecasting in one narrow domain, not the full breadth of strategic decisions a firm makes. Pricing, market entry, M&A negotiation, talent calls, all involve context Kickstarter campaigns do not.

The dataset is also modest. The pairwise comparisons clear statistical significance for the rank-ordering result; they do not settle every downstream debate the result will trigger. A replication on a different platform, with different stakes, and with senior investors rather than recruited managers would tighten the claim considerably.

What the paper does say plainly is that AI has crossed a line in one corner of strategic work it was not credited with before. If hybrid pipelines reliably underperform the AI alone on this task, every firm that has built a human-in-the-loop workflow for prediction faces a measurement question it has been able to avoid. If the firms that run the counterfactual openly in the next twelve months find the same pattern, the augmentation trap moves from a working-paper curiosity to a redesign mandate. If they find that genuine domain expertise rescues the hybrid, the architecture survives in a narrower form. Either way, the assumption that adding a human always helps does not survive contact with the data the way it did before this paper.