Which AI Admits When It Does Not Know? We Tested All 5 With 20 Trick Questions
Last updated: May 2026
The most dangerous thing an AI can do is not fail to answer. It is answer confidently with something completely made up. We built 20 questions deliberately designed to bait hallucinations: fake historical events, invented academic papers, questions with no correct answer, and premises that are flatly wrong. Then we ran them through all five major AI models and scored every response.
Comparison Table: AI Honesty and Hallucination Rates
Responses were scored across four dimensions: hallucination rate, appropriate hedging, fabricated citations, and correct identification of false premises.
| Model | Hallucinated (confident wrong) | Appropriate hedge | Fabricated citation | Corrected false premise | Honesty Score /20 |
|---|---|---|---|---|---|
| Claude 3.7 Sonnet | 4/20 | 12/20 | 2/20 | 8/10 | 16/20 🏆 |
| Perplexity Pro | 5/20 | 11/20 | 3/20 | 7/10 | 15/20 |
| Gemini 1.5 Pro | 8/20 | 9/20 | 5/20 | 6/10 | 12/20 |
| ChatGPT-4o | 11/20 | 6/20 | 7/20 | 4/10 | 9/20 |
| Grok 2 | 13/20 | 4/20 | 9/20 | 3/10 | 7/20 |
What We Tested and Why
We designed 20 questions to stress-test epistemic honesty in five distinct ways. Six questions used completely fabricated historical events: treaties that never happened, organisations that never existed, and speeches that were never given. Four questions cited invented academic papers with plausible-sounding titles, journals, and author names. Four questions asked about niche obscurities where almost no training data would exist. Three questions described near-future events that had not yet occurred at the time of testing. Three questions built on false premises, where the correct answer was to reject the question rather than answer it.
Scoring was binary per question. A response earned a point if the model either correctly admitted uncertainty, hedged meaningfully without fabricating supporting detail, or identified the false premise.
ChatGPT: Confidence Without Caution
ChatGPT-4o was the second most likely model to hallucinate, fabricating confident answers on 11 of 20 questions. When presented with a plausible-sounding but fictional historical event, ChatGPT would often not only confirm the event but add contextual detail, names, and outcomes that it invented wholesale. The 1923 Berlin Trade Accord, which does not exist, was described with specificity about participating nations, treaty terms, and its eventual dissolution. Every word was fabricated.
On questions where the premise was wrong, ChatGPT corrected the premise only 4 out of 10 times. In the remaining six, it accepted the wrong premise and built an elaborate answer on top of it - the most dangerous failure mode for users who ask questions based on misconceptions.
- Where it struggles most: Obscure historical facts, fake academic papers, questions with false premises
- Where it hedges appropriately: Explicit future event questions, medical diagnoses, legal outcomes
- Risk level for unsupervised research: High
Claude: The Most Epistemically Honest
Claude achieved the highest honesty score, fabricating confident wrong answers only 4 times out of 20. More impressive than the raw number is the quality of the hedging responses. When Claude encountered a question it could not reliably answer, it typically said so explicitly: "I do not have reliable information on this specific event and I would not want to speculate" or "The premise of this question does not match my understanding of the historical record."
On the false-premise questions, Claude corrected the wrong premise 8 out of 10 times - more than any other model. Anthropic has written publicly about building models that acknowledge uncertainty, and that design philosophy shows up clearly in the outputs.
- Where it excels: False premise detection, epistemic hedging, refusing to invent supporting detail
- Where it still fails: Niche historical obscurities where training data is thin but not zero
- Risk level for unsupervised research: Low to moderate
Gemini: Hedges More Than It Hallucinates
Gemini landed in the middle with a score of 12 out of 20. It hallucinated on 8 questions - meaningfully better than ChatGPT and Grok - and hedged appropriately on 9 occasions. Gemini tended to perform well on questions where it had been trained to express uncertainty (medical, legal, financial topics) but performed poorly on fabricated historical events and invented academic papers.
- Where it excels: Trained uncertainty domains; recent event hedging with Search Grounding
- Where it still fails: Fabricated historical events; plausible but invented research papers
- Risk level for unsupervised research: Moderate
Perplexity: Honest When Grounded
Perplexity scored 15 out of 20, second only to Claude. Its core architecture gives it a structural advantage: when it runs a live search for a claimed historical event and finds no results, it is more likely to report that absence than to fabricate an answer from training data. The failure cases were revealing - when the fake premise generated search results superficially related to the claim, Perplexity sometimes hallucinated a conflated answer mixing real and invented detail.
- Where it excels: Events with searchable footprints; questions where absence of results is informative
- Where it still fails: Conflation errors where real and invented detail mix
- Risk level for unsupervised research: Low to moderate
Grok: The Most Confident Fabricator
Grok 2 scored the lowest, fabricating confident answers on 13 out of 20 questions and producing invented citations on 9 occasions. On several false-premise questions, Grok not only accepted the premise but expressed frustration with alternative framings when pushed, reinforcing the fabricated narrative rather than stepping back from it.
- Where it struggles most: Every hallucination category; especially false premises and invented papers
- Where it performs reasonably: Real-time social context, casual questions without precise factual requirements
- Risk level for unsupervised research: Very high
Notable Question Examples
Why Disagreement Between Models Is the Signal
Our test produced one finding above all others with practical implications: when one model says "I do not know" while four others fabricate, that disagreement is the most important signal. A claim that only appears in models prone to hallucination should be treated with far more suspicion than a claim consistent across cautious models. Talkory Common Answer view shows you exactly this - which answers are shared across all models and which appear in only one or two.
There was not a single fabricated historical event in our test where Claude and Perplexity both hallucinated the same false detail. The convergence of cautious models on uncertainty, combined with the divergence of less cautious models on fabrications, creates a reliable signal. Use the disagreement - it is the data.
Final Verdict
On the question of which AI is most truthful, the ranking from our test is: Claude, then Perplexity, then Gemini, then ChatGPT, then Grok. The gap between Claude and Grok is not a matter of nuance. It is the difference between a tool that actively flags its uncertainty and a tool that actively hides it. Running your query through multiple models simultaneously - which is what Talkory automates - gives you the best signal of all.
People Also Ask
- Does ChatGPT lie or make things up?
- Which AI model is the most accurate?
- What is AI hallucination and how do I spot it?
- Is Claude more honest than ChatGPT?
- Which AI is best for factual research?
FAQ
Q: Does ChatGPT make things up?
Yes. In our test, ChatGPT-4o produced confident fabricated answers on 11 out of 20 questions, including inventing complete historical events, academic papers, and treaty terms.
Q: Which AI is least likely to hallucinate?
Based on our test, Claude 3.7 Sonnet hallucinated on the fewest questions (4 out of 20) and was the most likely to admit uncertainty or correct a false premise. Perplexity was a close second with 5 out of 20.
Q: What is the difference between an AI hallucination and a mistake?
A hallucination means the model generates false information presented as fact, often with confident tone and invented supporting detail. A simple mistake might be an arithmetic error. A hallucination involves fabricating entire events, papers, or quotes that never existed.
Q: Can any AI reliably tell when a premise is wrong?
Claude and Perplexity are the most reliable, correcting false premises correctly in 8 out of 10 and 7 out of 10 cases respectively. No model is perfect. All models are trained to be helpful and answer questions, which creates an incentive to accept and elaborate on premises rather than challenge them.
Q: Should I trust AI for factual research?
Use AI as a starting point, not a final source. The safest approach is to run queries through multiple models, note where they disagree, treat disagreements as red flags, and never cite AI output without checking the original source. Talkory makes multi-model comparison easy and surfaces disagreements automatically.