Who’s Grading the Graders? The AI Peer Review Scandal That Shook Academic Science

By the time Graham Neubig finished reading his peer review, something felt deeply wrong. The Carnegie Mellon AI researcher had received a 3,000-word evaluation of his latest paper — dense with bullet points, oddly verbose, and bristling with 40 listed weaknesses. The critiques demanded statistical analyses nobody in his field ever asks for. The feedback missed the entire point of his work. And some of the citations the reviewer invoked did not exist.

He was not being evaluated by a lazy graduate student or a distracted professor. He was, in his own words, being “bossed around by low-quality AI.”

That experience was not an anomaly. It was a data point in what would become one of the most alarming academic scandals of recent years.

The Numbers Are Staggering

In November 2025, Pangram Labs — working from a public challenge issued by Neubig — screened all 75,800 peer reviews submitted to the International Conference on Learning Representations (ICLR) 2026, one of the most prestigious gatherings in machine learning. The findings were stark: 21% of peer reviews were flagged as fully AI-generated. More than half of all reviews showed at least some signs of AI involvement. And 199 of the 19,490 submitted manuscripts — about 1% — were themselves found to be entirely machine-written.

One in five peer reviews at a leading AI conference was written entirely by AI. The humans, apparently, had decided that was good enough.

To understand why this matters, it helps to understand what peer review is supposed to do. When a researcher submits a paper to a conference or journal, anonymous experts in the relevant field evaluate the work — checking methodology, identifying errors, assessing significance, and recommending acceptance or rejection. It is the primary mechanism by which science maintains its quality standards. It is, in theory, the part of the process that a machine cannot fake.

ICLR 2026 proved that assumption wrong.

Why This Is Bigger Than One Conference

The ICLR scandal is not an isolated incident. Earlier analyses had estimated that up to 17% of reviews at major AI conferences in 2023 and 2024 involved some form of AI assistance. The 2025 data represent an escalation: from “some AI help” to one-in-five reviews that were written entirely by a language model, with a human barely in the loop — if at all.

The broader implication is uncomfortable: if this was happening at ICLR, what was happening at NeurIPS, ICML, AAAI, and dozens of other conferences operating under the same overload conditions and the same flimsy honor codes? Nobody was looking until Neubig got suspicious about one review of one paper.

Academic peer review was already strained before AI arrived. The number of papers submitted to major conferences has exploded in recent years, while the pool of qualified reviewers has grown far more slowly. Researchers were already doing peer reviews for free, under time pressure, as a kind of professional civic duty. The temptation to offload that duty to a language model was always going to be significant. What the ICLR data revealed is just how many people had quietly given in to it.

The Deeper Problem: Credential Without Capability

There is a parallel here that should not be missed. The AI peer review scandal is a concentrated version of the same dynamic playing out in classrooms across the country — and increasingly showing up in labor market data.

In both cases, AI is being used to satisfy a formal requirement while bypassing the intellectual work that requirement was designed to produce. A student who uses AI to write their paper has a grade. A reviewer who uses AI to evaluate a manuscript has submitted a review. In neither case has the underlying intellectual process actually taken place.

The Trump administration’s new gainful employment rule — which requires college programs to demonstrate financial return on investment or lose access to federal aid — is attempting to measure a related symptom. When nearly a quarter of bachelor’s programs and 43% of master’s programs fail to produce graduates who can justify their debt load, part of what that statistic reflects is an education system that has been handing out credentials without ensuring the formation those credentials are supposed to represent.

AI-generated peer reviews are the academic establishment’s version of the same problem. A credential — the peer-reviewed publication — is being issued without the quality-assurance process that credential was supposed to certify. The publication looks legitimate. The science may not be.

What Genuine Expertise Actually Looks Like

The researchers who identified the ICLR problem — Neubig, the Pangram Labs team, the dozens of academics who raised flags on social media — did so because they recognized something was missing. The reviews lacked the texture of genuine expert engagement: the specific knowledge of the field, the ability to identify what was actually novel, the judgment to distinguish a methodology flaw from a design choice. Those are capabilities that come from years of immersion in a discipline, not from querying a language model.

That distinction matters as much for students as it does for scientists. The labor market that graduates are entering is increasingly capable of detecting the difference between credential and capability — even when a transcript or a CV cannot. Stanford’s research on AI and labor displacement has consistently found that workers whose value came from codified, rule-based knowledge tasks are the ones most at risk. Workers who bring genuine judgment, tacit expertise, and the ability to evaluate and direct AI output are the ones holding their ground.

A peer reviewer who uses AI to write their reviews is, in a precise sense, replacing their own expertise with a tool that cannot replicate it. A student who uses AI to complete their assignments is doing the same thing. In both cases, the credential survives. The capability does not.

The Bottom Line

The ICLR 2026 scandal is not primarily a technology story. It is a story about what happens when the systems designed to certify intellectual quality are systematically hollowed out by the tool whose impact they are supposed to evaluate.

The machines, as one observer noted, have learned to evaluate science. They just have not learned to do it well. And the humans responsible for building and studying artificial intelligence — the researchers best positioned to understand what the technology can and cannot do — turned out to be among the first to decide that was good enough.

The question that follows is not whether AI should be used in academic work. It is whether the humans using it are developing genuine expertise alongside the tool — or simply outsourcing the intellectual work that expertise requires.

That question does not have a technical answer. It has an educational one.