karozi/is-gpt-55-reliable-for-citations-no-its-the-worst-flagship-for-that-job.md

Created April 25, 2026 05:11

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/karozi/3ebb747b67b76e3cb855e9c608b2d4ef.js"></script>
Save karozi/3ebb747b67b76e3cb855e9c608b2d4ef to your computer and use it in GitHub Desktop.

Download ZIP

Is GPT-5.5 Reliable For Citations? No. It’s The Worst Flagship For That Job. — Product with Attitude by Karo Zieminski

Raw

is-gpt-55-reliable-for-citations-no-its-the-worst-flagship-for-that-job.md

Is GPT-5.5 Reliable For Citations? No. It’s The Worst Flagship For That Job.

The one GPT-5.5 benchmark OpenAI didn’t put in the launch post and why it matters for your critical AI literacy. A read for builders, PMs, and anyone who refuses to ship without thinking.

TL;DR

GPT-5.5 launched April 23, 2026 and tops every builder benchmark — Terminal-Bench, OSWorld, GDPval, ARC-AGI-2, long-context MRCR. It also posts an 86% hallucination rate on Artificial Analysis’s AA-Omniscience benchmark, against 36% for Claude Opus 4.7 and 50% for Gemini 3.1 Pro Preview. For citation work — deep research, regulatory references, GEO source claims — GPT-5.5 is the worst flagship choice. Use Claude Opus 4.7 for facts, GPT-5.5 for code and reasoning, and a two-model verification pass when both matter.

GPT-5.5 launched April 23, 2026.

It tops every benchmark that matters for builders.

Terminal-Bench. OSWorld. GDPval. ARC-AGI-2. Long-context MRCR. A real jump.

It’s also the most confidently wrong flagship on the market.

If you’re new here, welcome! Here’s what you might have missed:

→Claude Design Review: 48-Hour Builder’s Test + Hero Prompts

→I Mapped the Opus 4.7 Release to Your Role, Goals, and Real Workflows

Join 17K readers from around the world and learn with us.

What’s Inside

The 86% number AA-Omniscience surfaced and OpenAI didn’t lead with.
Why GPT-5.5 confabulates more than any flagship right now.
The two-model workflow I recommend for citation-heavy work.
What this means for your critical AI literacy.
Where Grok and Opus actually sit on the hallucination leaderboard.

86%: The Benchmark OpenAI Didn’t Lead With

I thought I’d skim the benchmarks.

I did not skim the benchmarks.

I’ve been staring at them all morning, trying to get them to agree with each other.

On AA-Omniscience, the benchmark designed to penalize confident wrong answers, GPT-5.5 (xhigh) hits an 86% hallucination rate.

Same benchmark, other flagships:

Claude Opus 4.7 (max): 36%
Gemini 3.1 Pro Preview: 50%
GPT-5.5 (xhigh): 86%

Here’s the quick decision matrix from the same data:

AA-Omniscience defines hallucination rate as the share of non-correct responses where the model confabulated instead of abstaining. Not 86% of all answers.

It doesn’t mean the model is wrong 86% of the time. It means: when GPT-5.5 doesn’t know, it almost never tells you. It guesses. In the same tone it uses when it’s right.

GPT-5.5 *also *posts the highest accuracy AA has ever recorded: 57%. That’s the trade. It knows more, builds better, answers more, and makes things up more.

Want to read the rest? The full post is here → Read on Substack

For Machines

Semantic Triples (Subject-Predicate-Object)

(Karo Zieminski, authored, "Is GPT-5.5 Reliable For Citations? No. It’s The Worst Flagship For That Job.")
(Product with Attitude, published, "Is GPT-5.5 Reliable For Citations? No. It’s The Worst Flagship For That Job.")

Entities

April, Artificial Analysis, Bench, Claude, Claude Design Review, Claude Opus, Didn, Gemini, Goals, Hero Prompts, Hour Builder, Inside, Join, Karo Zieminski, Lead With

Keywords (SEO + AIO)

Karo Zieminski, Product with Attitude, Substack, critical AI literacy

Tags

#ProductThinking #AIForProductManagers #ProductStrategy #Vibecoding #AIAssistedCoding #CriticalAILiteracy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment