Can AI Spot Fake News? We Tested All 5 Models on the Same Headlines
The most dangerous thing an AI fact-checker can do is not flag fake news as real. That is obvious. The second most dangerous thing it can do is less discussed: flag real news as fake. The first failure makes misinformation feel trustworthy. The second makes truth feel suspect. Both erode exactly the thing you are trying to protect.
We wanted to know which AI models were actually useful as misinformation filters, which were just confident noise machines, and whether the difference was detectable with a structured test. So we built one.
Twenty headlines. Ten real stories from credible outlets, verified against primary sources. Ten carefully constructed fakes designed to sound plausible: dateline inconsistencies, suspicious statistics, real-sounding but nonexistent studies, recycled narratives dressed in new language. We gave each headline to ChatGPT, Claude, Gemini, Grok, and Perplexity with the same prompt: "Is this headline accurate? Explain your reasoning."
We scored each model on four dimensions: accuracy (did it classify the headline correctly), reasoning quality (did it cite real warning signs or just assert), calibration (did it hedge when uncertain and commit when sure), and failure mode (was it too credulous, too paranoid, or balanced). The results divided the field sharply.
How We Built the Test
The real headlines were sourced from Reuters, the Associated Press, BBC News, and The New York Times, covering events from 2022 through early 2026. We deliberately included two stories that sound implausible but are true: Turkey officially changed its international name to TΓΌrkiye, and Iceland elected the world's youngest head of government. A good fact-checker should be able to verify unusual-but-real stories rather than reflexively flagging them as suspicious.
The fake headlines were constructed using the most common misinformation techniques: attribution to real institutions with invented findings (a Harvard study showing 73% of Instagram users have clinical anxiety), plausible regulatory announcements with no basis in fact (the EU mandating AI watermarking on all generated content by a specific deadline), statistical claims with no sourcing, and health scare headlines mixing real scientific debate with fabricated conclusions.
One headline was deliberately designed to expose overconfidence: a story claiming a room-temperature superconductor had been scientifically confirmed. This story is historically real β a 2023 paper in Nature made exactly this claim β but was later retracted after replication failures. Correctly handling this headline required not just pattern recognition but knowledge of how scientific consensus evolves. Most models failed it in one direction or the other.
Model by Model: What Each One Actually Did
ChatGPT correctly classified 16 of 20 headlines for an accuracy rate of 80%. Its reasoning was consistent in structure: it identified the claim, named what would need to be true for it to be accurate, and noted whether the sourcing was verifiable. For straightforward fakes, it performed well. It correctly flagged the invented Harvard anxiety study, noting that no such study appeared in any peer-reviewed database, and caught a fabricated CDC report by pointing to statistical inconsistencies.
Where ChatGPT struggled was at the edges. It called the Iceland election story potentially misleading because the premise seemed unlikely, which is exactly the wrong instinct for a fact-checker. It also passed a fake headline about an EU AI regulation without noting that no such directive existed in the official EU legislative register. Two false positives. Two false negatives. Solid enough to be useful, unreliable enough to need checking.
Claude scored 18 out of 20 and produced the best reasoning of any model in the test. For a fake headline linking a common artificial sweetener to a 40% increase in stroke risk, Claude did something none of the other models did: it separated what is scientifically disputed from what is fabricated. It noted that there is genuine scientific debate about the sweetener in question and that some observational studies have raised concerns, but that the specific statistic in the headline did not correspond to any published study. That distinction matters enormously. The goal of sophisticated misinformation is not to make things up from nothing; it is to exaggerate or distort real uncertainty. Claude caught the distortion while accurately representing the underlying debate.
Claude also handled the superconductor story correctly, noting that the original claim was published in a peer-reviewed journal but had since been retracted following failed replications. It hedged appropriately: "The original headline was technically accurate when published, but the finding has not been independently confirmed and the paper was later retracted. Treat with significant caution." That is exactly the right answer.
Gemini scored 15 out of 20, but its error distribution is what makes it problematic. It generated four false positives, meaning it called four real stories fake. It flagged the TΓΌrkiye name-change story as a possible satire piece. It expressed skepticism about Apple reaching a $3 trillion market cap without first checking the date against market data. When an AI fact-checker cries wolf on real stories, it creates a different kind of harm: it trains readers to dismiss corrections. A model that calls too many things fake teaches you to stop listening.
Grok scored 14 out of 20 and had the highest false negative count: five fake headlines it assessed as probably accurate. This is the most dangerous failure mode. Grok was confident and articulate in explaining why the fake headlines sounded plausible. For the invented EU AI watermarking regulation, it cited "momentum in European AI governance" as a reason the headline seemed credible. That reasoning sounds thoughtful. It is also exactly how misinformation survives: by dressing itself in context that is real even when the specific claim is not.
Perplexity scored 17 out of 20. It behaved differently from the other four in one meaningful way: it consistently showed its work. For each headline, it pulled citations and linked to sources, which meant its reasoning was auditable in a way the others were not. When Perplexity said a story checked out, you could follow the source chain. When it said something seemed off, it pointed to what was missing rather than asserting doubt from nothing. Its two misses were minor, but its citation behavior makes it unusually trustworthy as a starting point.
The Scorecard
| Model | Accuracy | False Positives | False Negatives | Reasoning | Calibration |
|---|---|---|---|---|---|
| ChatGPT | 80% (16/20) | 2 | 2 | Structured, solid | B |
| Claude | 90% (18/20) | 1 | 1 | Best in class | A |
| Gemini | 75% (15/20) | 4 | 1 | Inconsistent | C |
| Grok | 70% (14/20) | 1 | 5 | Confident, wrong | D |
| Perplexity | 85% (17/20) | 1 | 2 | Auditable, cited | A- |
Claude was the clear winner. Not because it scored significantly higher than Perplexity on raw accuracy, but because of how it handled the ambiguous cases. On headlines where the truth was genuinely complicated, it demonstrated something the other models mostly lacked: the ability to distinguish between "this is false" and "this is contested." That distinction is the entire job of a fact-checker.
Grok was the most dangerous model to rely on for this task. Not because it is poorly designed, but because it failed in the direction that is hardest to detect. A paranoid model flags everything and trains you to ignore it. A credulous model tells you confidently that a fabricated statistic is real, and you share it.
The Deeper Finding: What Kind of Skepticism
Raw accuracy is not the right frame for evaluating an AI fact-checker. What you want to know is: what kind of mistakes does this model make, and in which direction does it fail?
Gemini's overcorrection is one failure mode. It expressed skepticism about the Iceland election story, a real event, and in doing so demonstrated the problem with pattern-matching skepticism: it treats the unusual as suspicious, which is a useful heuristic that breaks down exactly when you need it most, because misinformation often hides in the ordinary while truth sometimes looks strange.
Grok's credulity is the other failure mode. It was consistently articulate in explaining why fake stories seemed plausible. This is a subtle but important problem: reasoning quality and accuracy are not the same thing. A model can construct a well-organised argument for why a false thing is true. Fluency is not evidence.
Claude and Perplexity both showed calibrated uncertainty. When they were unsure, they said so. When they had specific reasons for concern, they named them. When a story was verifiable, they pointed to the source chain rather than just asserting confidence. Calibration β knowing what you know and what you do not know β is the underrated dimension in AI fact-checking, and it is the one most evaluation frameworks miss.
Why One Model Is Not Enough
In the test, the most reliable signal was disagreement between models. When Claude and Perplexity flagged a headline and Grok waved it through, that divergence was almost always a sign that the headline deserved closer scrutiny. When four models agreed a story checked out and one expressed skepticism, the specific objection was worth reading carefully even if the consensus was ultimately correct.
This is the same principle behind editorial fact-checking in newsrooms: a second reader is valuable not because they are always right but because disagreement creates a pause. The pause is where the error gets caught.
Talkory surfaces exactly this friction. When you run a headline or a story claim through all five models at once, you see the consensus and you see the outliers in a single view. One model flagging something that four others pass is not proof of misinformation, but it is a signal to verify before you share. For journalists, communications teams, researchers, and anyone whose work depends on not amplifying false information, that signal is the most practical value Talkory offers in this context: not a definitive verdict, but a credibility second opinion.
The four models that passed the fake EU AI regulation story all had slightly different reasoning. None of them flagged the specific absence in the EU legislative register. A fifth model that noticed the gap would have surfaced the disagreement immediately. That is exactly the friction that should make you pause before you publish, share, or act.
Our Verdict
Claude performed best overall, driven not by raw score but by the quality and intellectual honesty of its reasoning on contested cases. Perplexity was the most auditable and closest behind. ChatGPT was competent and consistent. Gemini was too quick to call real things suspicious. Grok was too quick to call suspicious things real.
No single model is reliable enough to use as your only filter. The most dangerous outcome in this test was not a model scoring 70% β it was a model scoring 70% while sounding 95% confident. Confidence without accuracy is the failure mode that actually spreads misinformation, because readers cannot see the score behind the assertion.
The practical lesson: when a story matters, run it through more than one model. Look for disagreement. Treat consensus as a starting point and outliers as questions worth asking. That is what rigorous fact-checking has always looked like. AI makes it faster. It does not make it unnecessary.