How AI Benchmarks Became the New Geopolitical Battleground

Ishan Crawford 4 hours ago 0 3

The United States poured $285.9 billion into private AI investment in 2025, 23 times China’s $12.4 billion, according to Stanford’s 2026 AI Index. Almost every test those models are judged on, from MMLU to LMArena Elo to SWE-bench, was built, hosted, and refreshed inside the same handful of US universities and frontier labs that the models came from.

That overlap is no longer a curiosity for machine-learning researchers. Benchmark scores now drive procurement decisions inside finance ministries, threshold rules inside the EU AI Act, and risk language inside national AI strategies from New Delhi to Seoul. The tests have quietly become policy infrastructure, and most of the countries depending on them had no seat at the table when they were written.

Why a Test Score Now Reads as National Policy

A benchmark used to be a research artifact. A number in a paper, a leaderboard slot, a marketing line in a product launch. That description no longer captures what these tests do.

Generative AI evaluation has folded into three policy stacks at once. Regulators use benchmark families to define what counts as a high-risk model under laws like the EU AI Act. Procurement officers cite leaderboard positions when picking which model to embed in tax, health, or border systems. And national strategy documents, including the Trump administration’s Winning the Race AI Action Plan, treat the construction of an evaluation ecosystem as a strategic priority on par with chips and compute.

The Action Plan, released in July 2025, instructs the National Institute of Standards and Technology (NIST, the federal measurement agency) to build domain-specific productivity benchmarks for AI in healthcare, energy, and agriculture, channelled through what was the US AI Safety Institute and is now, since June 2025, the Center for AI Standards and Innovation (CAISI). The framing is explicit: whoever sets the measurement defines the race.

China’s posture is parallel but mirrored. State guidance pushes domestic firms to embed Chinese language, data, and value alignment into open-weight models, with benchmark suites like SuperCLUE refreshing their question banks every two months to stop foreign models from gaming the leaderboard. The exported product is the same in both cases: a yardstick that travels with the model.

AI benchmark geopolitics and evaluation sovereignty contest between US and China explained.

The Two Benchmark Stacks Competing for Global Default

The visible part of the contest is the model leaderboard. The structural part is the benchmark stack itself, and there are now two contending defaults that smaller markets are being asked to pick between.

The US stack is English-first, heavily academic in origin, and weighted toward general knowledge, code, math, and helpfulness ratings. MMLU, GPQA, SWE-bench, HumanEval, and LMArena Elo form the spine. The Chinese stack covers similar capability surfaces but ships in Mandarin, embeds local context (legal, regulatory, cultural references), and rotates harder against contamination. Both are open-source in the sense that the questions are public. Neither comes with full reproducibility of the underlying training pipelines they evaluate.

Attribute	US-led stack	China-led stack
Anchor benchmarks	MMLU, SWE-bench, GPQA, LMArena, HumanEval	SuperCLUE, C-Eval, CMMLU, SuperCLUE-Fin
Primary language	English	Mandarin Chinese
Refresh cadence	Annual or ad-hoc per benchmark	Question bank 100% refreshed every 2 months (SuperCLUE)
Cultural priors	Western liberal-democratic norms, English idiom	State-aligned content rules, Chinese legal context
Top-of-leaderboard firms	OpenAI, Anthropic, Google, Meta	Alibaba (Qwen), Zhipu (GLM), DeepSeek, Baichuan
Regulatory adoption	EU AI Act (via COMPL-AI), CAISI, India AI Safety Institute	Cyberspace Administration of China safety filings

Stanford’s 2026 index notes the top two frontier models are now separated by just 0.7 percentage points on the headline aggregate, and the gap between the top-ranked and the tenth-ranked model has compressed from 11.9 points to 5.4 in twelve months. When the headline scores are that close, the choice of which test you run becomes the entire story.

When the Leaderboard Gets Gamed

The Llama 4 Maverick episode is the clearest recent illustration that benchmark scores can be engineered for leaderboard performance and still leave the shipping product behind. In April 2025, Meta submitted a build labelled “Llama-4-Maverick-03-26-Experimental” to LMArena, where it landed at number two. Developers downloading the public release noticed something off.

The publicly available Llama-4-Maverick-17B-128E-Instruct, the version users could actually run, fell to 32nd place on the same board once it was scored. Analysis of the submitted version showed it produced longer, more emoji-heavy, more conversational responses, the precise stylistic profile that human raters on LMArena tend to prefer. Meta acknowledged the build was an “experimental chat version” optimised for conversationality and denied training on test sets. LMArena tightened its policies to require providers to submit production builds.

That episode sits inside a wider pattern documented in the academic literature. Surveys of benchmark contamination, the phenomenon where a model’s training corpus accidentally or deliberately contains the questions it will later be tested on, suggest the problem now affects most popular evaluation sets at some rate. A separate Stanford finding sharpens the structural worry: nearly 90% of notable AI models released in 2024 came from industry labs, up from 60% in 2023, which means the firms building the models, the firms scoring the highest on the benchmarks, and the firms most able to optimise for the benchmarks are converging.

63.2 percent of highlighted benchmarks are used by only a single model builder, according to the 2025 AI Index, meaning most benchmark mentions in launch materials are not shared standards but bespoke marketing instruments.

The Countries Testing Models Against Other People’s Tests

For governments outside the US and China, almost every meaningful frontier evaluation today is run on benchmarks designed somewhere else. That is the practical face of what researchers are starting to call evaluation sovereignty: the question of who defines the tests through which AI systems are judged in your jurisdiction, in your languages, against your social and security risks.

The dependency is not theoretical. Research on dominant large language models has repeatedly found that they reflect Western or English-speaking cultural priors, and India-specific studies have documented caste and religious stereotypes that conventional Western fairness benchmarks fail to capture. A model that scores cleanly on a US bias suite can still misbehave on a deployment in Tamil, Yoruba, or Bahasa Indonesia, and the buyer often has no domestic test set sophisticated enough to catch it before procurement.

Where the Gap Is Widest

Three structural gaps recur across mid-sized and emerging-market AI strategies. Languages and dialects below the top twenty are thinly represented in evaluation suites. Legal and constitutional norms specific to a country (data residency, hate-speech thresholds, electoral content rules) are absent from generic safety benchmarks. And sectoral risks that matter locally (financial-inclusion fraud patterns in Africa, agricultural advisory accuracy in South Asia, language-of-record requirements in Latin American courts) are rarely covered at all.

Why It Persists

Frontier labs keep training data, model weights, and full training pipelines closed for their most capable systems, even when the benchmarks are public. External researchers and regulators are left relying on company disclosures, voluntary red-team reports, and third-party audits rather than full technical inspection. That asymmetry preserves the labs’ commercial advantage and locks in the original benchmark designers as the de facto referees of capability claims worldwide.

The Regulatory Layer Catching Up

Three institutional moves over the past eighteen months have started to compete with the leaderboard-as-policy default. They are uneven, mostly underfunded, and individually narrow, but together they sketch what a more pluralistic evaluation regime could look like.

The AI Safety Institute network, agreed at the May 2024 AI Seoul Summit, links national bodies in the United States, United Kingdom, Japan, France, Germany, Italy, Singapore, South Korea, Australia, Canada, and the European Union, with a mandate to coordinate pre-deployment testing of frontier models and publish shared evaluation methodologies.
COMPL-AI, launched October 16, 2024 by ETH Zurich, Bulgaria’s INSAIT institute, and LatticeFlow AI, translates the EU AI Act’s regulatory language into 27 technical benchmarks and ran the first compliance-style evaluations of OpenAI, Meta, Google, Anthropic, and Alibaba foundation models. The open-source compliance benchmarking suite can be re-pointed at future regulations beyond the AI Act.
India’s AI Governance Guidelines, published in November 2025 ahead of the AI Impact Summit 2026, set out seven principles (trust, people-first governance, innovation over restraint, fairness and equity, accountability, understandability by design, and safety, resilience and sustainability) and create three new bodies including an India AI Safety Institute. The India AI Governance Guidelines document explicitly anchors safety testing in domestic harms including caste, religion, and language minorities.

France’s INESIA roadmap for 2026 to 2027 follows the same arc, bringing Inria, the cybersecurity agency ANSSI, the metrology lab LNE, and the digital regulator PEReN into a single evaluation architecture. CAISI in the US plans to publish a harmonised benchmark catalogue and co-lead ISO working groups on AI evaluation in 2026. None of these efforts produces a finished alternative stack. Each is a deliberate attempt to put a domestic referee between the model and the deployment.

What Evaluation Sovereignty Would Require

The current critique of benchmarks inside the research community is sharper than the policy response. Safety benchmarks compress fairness, risk, and discrimination into fixed metrics that cannot, on their own, guarantee long-term safety. Most dominant suites are static, while real human-AI interaction is dynamic, context-dependent, and shaped by socio-cultural factors a multiple-choice test cannot model. Benchmark design itself is concentrated among researchers at a small number of elite universities and firms, raising structural questions about whose languages and use cases get encoded as the default.

A practical evaluation-sovereignty programme, in the framing the European Commission’s Joint Research Centre has begun to favour, has four moving parts. Domestic institutions capable of authoring and refreshing local benchmarks. Independent auditors funded outside the model providers’ commercial relationships. Compliance frameworks that translate national law into measurable model tests, the way COMPL-AI did for the EU. And ongoing harmonisation through bodies like the AISI network so that local tests can talk to each other without collapsing into a single Western or single Chinese default.

None of the four is technically out of reach for a mid-sized economy. All four together remain rare. As of May 2026, the share of countries with domestic institutions running production-grade frontier evaluations against local benchmarks remains in the low single digits, and the share with full statutory authority to do so is smaller still.

Frequently Asked Questions

What is an AI benchmark and why does it matter geopolitically?

An AI benchmark is a standardised test suite that measures how well a model performs on tasks such as general knowledge (MMLU), coding (HumanEval), software engineering (SWE-bench), or human-preference ratings (LMArena Elo). It matters geopolitically because benchmark scores now feed directly into national strategies, regulatory thresholds, and government procurement decisions, making the choice of test a form of soft power.

Why are US-developed benchmarks dominant?

Most widely cited benchmarks were authored at US universities or US-headquartered frontier labs that also build the leading models. The 2025 AI Index found nearly 90% of notable AI models that year came from industry, and the same firms supply the benchmarks they top, which entrenches an English-language, Western-priors default across global evaluation.

What is evaluation sovereignty?

Evaluation sovereignty is the principle that domestic institutions should be able to design, audit, and validate the benchmarks used to judge AI systems deployed in their jurisdiction, in their languages, against their legal norms and social risks, rather than relying solely on imported leaderboards.

What happened with the Llama 4 Maverick benchmark?

In April 2025 Meta submitted an experimental, conversational build of Llama 4 Maverick to LMArena that ranked second, while the publicly released version of the same model placed 32nd on the same board. The incident pushed LMArena to require providers to submit production builds and became a frequently cited example of leaderboard gaming.

What is COMPL-AI?

COMPL-AI is an open-source compliance evaluation framework launched on October 16, 2024 by ETH Zurich, Bulgaria’s INSAIT institute, and LatticeFlow AI. It translates the EU AI Act into 27 technical benchmarks and produced the first regulatory-style scorecards for OpenAI, Meta, Google, Anthropic, and Alibaba foundation models.

Which countries have a national AI Safety Institute?

The AI Safety Institute network agreed at the May 2024 AI Seoul Summit includes the United States (now CAISI), the United Kingdom, Japan, France, Germany, Italy, Singapore, South Korea, Australia, Canada, and the European Union. India announced an AI Safety Institute as part of its November 2025 governance guidelines.

How can benchmarks be gamed?

Benchmarks can be gamed through training-data contamination (the test questions appear in pre-training data), submission of specially optimised non-public model variants, fine-tuning to favour the response style human raters prefer, and selective reporting of benchmarks where a model performs well while omitting those where it does not.

How AI Benchmarks Became the New Geopolitical Battleground

Why a Test Score Now Reads as National Policy

The Two Benchmark Stacks Competing for Global Default

When the Leaderboard Gets Gamed