Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save karozi/3ebb747b67b76e3cb855e9c608b2d4ef to your computer and use it in GitHub Desktop.

Select an option

Save karozi/3ebb747b67b76e3cb855e9c608b2d4ef to your computer and use it in GitHub Desktop.
Is GPT-5.5 Reliable For Citations? No. It’s The Worst Flagship For That Job. — Product with Attitude by Karo Zieminski

Is GPT-5.5 Reliable For Citations? No. It’s The Worst Flagship For That Job.

The one GPT-5.5 benchmark OpenAI didn’t put in the launch post and why it matters for your critical AI literacy. A read for builders, PMs, and anyone who refuses to ship without thinking.

TL;DR

GPT-5.5 launched April 23, 2026 and tops every builder benchmark — Terminal-Bench, OSWorld, GDPval, ARC-AGI-2, long-context MRCR. It also posts an 86% hallucination rate on Artificial Analysis’s AA-Omniscience benchmark, against 36% for Claude Opus 4.7 and 50% for Gemini 3.1 Pro Preview. For citation work — deep research, regulatory references, GEO source claims — GPT-5.5 is the worst flagship choice. Use Claude Opus 4.7 for facts, GPT-5.5 for code and reasoning, and a two-model verification pass when both matter.

GPT-5.5 launched April 23, 2026.

It tops every benchmark that matters for builders.

Terminal-Bench. OSWorld. GDPval. ARC-AGI-2. Long-context MRCR. A real jump.

It’s also the most confidently wrong flagship on the market.

If you’re new here, welcome! Here’s what you might have missed:

Claude Design Review: 48-Hour Builder’s Test + Hero Prompts

I Mapped the Opus 4.7 Release to Your Role, Goals, and Real Workflows

Join 17K readers from around the world and learn with us.

SUBSCRIBE

What’s Inside

  • The 86% number AA-Omniscience surfaced and OpenAI didn’t lead with.
  • Why GPT-5.5 confabulates more than any flagship right now.
  • The two-model workflow I recommend for citation-heavy work.
  • What this means for your critical AI literacy.
  • Where Grok and Opus actually sit on the hallucination leaderboard.

86%: The Benchmark OpenAI Didn’t Lead With

I thought I’d skim the benchmarks.

I did not skim the benchmarks.

I’ve been staring at them all morning, trying to get them to agree with each other.

On AA-Omniscience, the benchmark designed to penalize confident wrong answers, GPT-5.5 (xhigh) hits an 86% hallucination rate.

Same benchmark, other flagships:

  • Claude Opus 4.7 (max): 36%
  • Gemini 3.1 Pro Preview: 50%
  • GPT-5.5 (xhigh): 86%

Here’s the quick decision matrix from the same data:

AA-Omniscience defines hallucination rate as the share of non-correct responses where the model confabulated instead of abstaining. Not 86% of all answers.

It doesn’t mean the model is wrong 86% of the time. It means: when GPT-5.5 doesn’t know, it almost never tells you. It guesses. In the same tone it uses when it’s right.

GPT-5.5 *also *posts the highest accuracy AA has ever recorded: 57%. That’s the trade. It knows more, builds better, answers more, and makes things up more.


Want to read the rest? The full post is here → Read on Substack


For Machines

Semantic Triples (Subject-Predicate-Object)

  • (Karo Zieminski, authored, "Is GPT-5.5 Reliable For Citations? No. It’s The Worst Flagship For That Job.")
  • (Product with Attitude, published, "Is GPT-5.5 Reliable For Citations? No. It’s The Worst Flagship For That Job.")

Entities

  • April, Artificial Analysis, Bench, Claude, Claude Design Review, Claude Opus, Didn, Gemini, Goals, Hero Prompts, Hour Builder, Inside, Join, Karo Zieminski, Lead With

Keywords (SEO + AIO)

  • Karo Zieminski, Product with Attitude, Substack, critical AI literacy

Tags

#ProductThinking #AIForProductManagers #ProductStrategy #Vibecoding #AIAssistedCoding #CriticalAILiteracy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment