Everything You Need To Know About GPT-5.2 In 10 Minutes | In The Loop Episode 43

Published by

Jack Houghton

Anna Kocsis

Published on

December 18, 2025

December 29, 2025

Read time

min read

OpenAI's GPT-5.2 release

Earlier in December, Sam Altman issued a Code Red memo, which last week's episode discussed in detail. We covered the real problems OpenAI faces over the next 6–24 months. As predicted, they released GPT Garlic—now called GPT-5.2.

To cut to the chase: they've matched many of Gemini's most significant model updates in benchmark performance and image generation. Whether the press or market feels that yet, I can't say. But from a technical perspective, they've gone beyond what I expected.

GPT-5.2 comes in three versions:

Instant is their super-fast model with no thinking mode—just rapid responses.
Thinking is their standard model with reasoning built in. It pauses, works through the problem, creates a plan, and then responds.
Pro Extended Thinking is the new tier. Matt Schumer, who had access since November 25th, reported it was thinking for over an hour on some hard problems. Pro is only available on the $200 monthly subscription and frustratingly isn't available via API yet.

There's also a new Reasoning Effort setting within Thinking mode—Standard, High, or Extra High. The higher you set it, the longer it thinks and theoretically the better the output.

From a technical perspective, the context window is now 400,000 tokens compared to GPT-5.1's 128,000. That's a substantial uplift. For those wondering what context windows are: it's the amount of information you can give the model before it gets poor, stupid, and annoying, or just says "Sorry, limit reached, move to a new chat."

There's an auto setting that's supposed to be better at choosing between Instant and Extended Thinking. I'd recommend ignoring it. From what I've read and limited testing—I recently cancelled my ChatGPT license—the model often automatically thinks for a couple seconds and gets an answer that's poor or wrong. You'll need Thinking for most professional work.

‍

How GPT-5.2 is better: what has been improved

The most important thing with every new model release, of course, is what has been improved—and what users say about the new model.

‍

Spreadsheets and documents

Let's talk spreadsheets and presentations—areas where OpenAI clearly put marketing energy and made real improvements.

Simon Willison said this is the first time ChatGPT has created spreadsheets and presentations that are actually presentable. The YouTube reviewer Skill Leap gave it a web link and asked for a full slideshow. It took 28 minutes, but the output was really impressive—good layouts, information pulled correctly, professional-looking slides. His words: "shockingly good compared to 5.1."

Another tester fed 10,000 rows of spreadsheet data into it and told it to create a PowerPoint. It made an excellent set of slides.

For those doing this work constantly, this is music to your ears. However, there are fantastic tools like Gamma that do much the same thing.

OpenAI's benchmark for this is called GDP Valuing—essentially, well-specified knowledge work tasks across 44 occupations. They claim 5.2 Thinking mode beats or ties with human experts 70.9% of the time, up from 38.8%.

"Well-specified" is doing a lot of work in that sentence. It means the model gets handed everything up front—super clear instructions, all relevant context, and defined success criteria. Real professional work isn't like that. You often have to figure out what information you need, go find it, make judgment calls, craft good prompts.

That benchmark covers well-specified knowledge work with perfect prompts. Most people aren't giving such well-thought-out, structured prompts. 70.9% doesn't mean GPT-5.2 can suddenly do 71% of a person's job. It means for tasks where everything is perfectly articulated and handed to the model on a plate, it performs at expert level most of the time.

‍

Code generation improvements

As I said last week, they're not trying to win in the code arena. That said, they've made fantastic leaps in coding. Maybe they were working on a better coding model, realized they'd made a better model generally, and released it—because most of their marketing focuses on professional work.

On the SWE Pro benchmark, which tests software engineering across four programming languages, 5.2 Thinking scored 55.6%—a new state-of-the-art. On another similar benchmark called SWE Benchmark Verified, it hit 80%, essentially matching Claude's best model at 80.9%.

‍

Vision and long context

Vision capabilities have improved significantly. On chart understanding from scientific papers, accuracy jumped from 80% to 88%. On user interface understanding—that agent mode of reading your screen and making clicks—it jumped from 64% to 86%. Error rates have been cut in half.

On context windows, there's been massive improvement. With 5.1, accuracy started degrading as the amount of information you gave it grew—around 90% at 8,000 tokens, dropping under 50% at 256,000 tokens. With GPT-5.2, accuracy stays at almost 100% across the entire context window, even when nearly maxed out.

This is one of the first models to achieve near-perfect accuracy on the four needle challenge—essentially recalling four specific pieces of information scattered across 200,000 words.

‍

Hallucinations

OpenAI claims they've reduced hallucinations by 30%—from 8.8% in 5.1 to 6.2% in 5.2. However, more independent benchmark reviews have given more modest scoring. Vectara found GPT-5.2 had an 8.4% hallucination rate, which trails DeepSeek at 6.3%.

A massive improvement, but still not a leading model.

‍

What still needs work: speed and writing quality

Speed is still a real problem. Matt Schumer said standard 5.2 Thinking is extremely slow—very slow for most questions, even straightforward ones, which changes how he works. Quick questions mean he goes to Claude Opus. Deep reasoning now goes to 5.2 Pro. Quite interesting because it used to be the other way round for me.

For those who do a lot of writing, quality still lags behind Claude.

Dan Shipper's publication Every ran systematic tests and found Claude Opus 4.5 scored 80% in writing quality whereas GPT-5.2 scored 74%. Many testers noticed big personality changes. Alli Miller, another big commentator in the AI space, said a simple question turned into 58 bullet points and numbered points. Many people have been comparing 5.2 to a brilliant freelancer who over-formats everything.

As you know, I've said repeatedly that benchmarks aren't the be-all and end-all. Comparing models is getting much harder and performance is scaling very iteratively. It's important to test this out yourself and see if you like the improvements.

‍

‍

Closing thoughts

OpenAI's messaging on this release has been very focused, which is unusual. Often they've been "we're the best at everything for everyone" with scattered messaging. This time, every executive in every interview and media appearance focused on professional work and economically valuable tasks.

They're clearly not trying to claim AGI breakthroughs. They're trying to win at professional work and going for the enterprise market.

The improvements are real. Structured outputs make this the most capable model GPT has ever produced. If your tasks involve slides and spreadsheets, 5.2 is a serious upgrade.

But if you zoom out, you're seeing incremental progress, not massive leaps anymore. Some people still hope for this big flash of inspiration—one model that conquers it all. But what we're seeing is a pattern of that not being quite true.

5.2 is a much better tool, but it's not a new era. The fact that OpenAI is marketing better spreadsheets tells us a lot about where we are with AI right now.

That's it for this week and for this year. I hope you found this episode interesting and I look forward to spending 2026 with you. Thank you and see you next year.

‍

Table of contents

Articles

Stay tuned for the latest AI thought leadership.

Everything You Need To Know About GPT-5.2 In 10 Minutes | In The Loop Episode 43

Published by

Published on

Read time

Category

OpenAI's GPT-5.2 release

How GPT-5.2 is better: what has been improved

Spreadsheets and documents

Code generation improvements

Vision and long context

Hallucinations

What still needs work: speed and writing quality

Closing thoughts

Become an AI expert

Articles

Why Your AI Agents Need Visual Intelligence (Not Just Text Responses)

The Agentic Frontend: What It Is And Why Every Product Needs One

11 AI Predictions That Will Shape Product Development In 2026

The Build vs Buy Question Every CTO Gets Wrong

Everything You Need To Know About GPT-5.2 In 10 Minutes | In The Loop Episode 43

Code Red: "We're At A Critical Time For ChatGPT." | In The Loop Episode 42

How To Decide What To Automate With AI For Your Team & Customers | In The Loop Episode 41

Why ChatGPT Atlas Browser Won’t Take Down Google | In The Loop Episode 36

AI’s Just Made Robotics Interesting Again | In The Loop Episode 35

What Is AI Workslop & How To Fix It | In The Loop Episode 34

Top Three Announcements From OpenAI DevDay 2025 | In The Loop Episode 33

AI Agent Memory: Why Your AI Agents Keep Forgetting Everything (And How We Fixed It)

Meta Ray-Ban Display Smart Glasses: Yay Or Nay? | In The Loop Episode 32

What are AI Companions & Should They Be Legal? | In The Loop Episode 31

The Real Cost Of AGI—According To OpenAI | In The Loop Episode 30

Is The AI Bubble About To Burst? | In The Loop Episode 28

What’s Replacing SCORM—And Should SCORM Be Replaced Or “Just” Transformed?

Top Four AI Trends & Predictions Of Summer 2025 | In The Loop Episode 26

Why Is Corporate E-Learning So Bad & How To Fix It With AI? | In The Loop Episode 25

GPT-5 Review: Everything You Need To Know | In The Loop Episode 27

What’s The Future Of SCORM With AI?

Why do people use SCORM?

What Is Context Engineering And Why Should You Care? | In The Loop Episode 23

How Do I Integrate AI Into My Product—Ideally By Yesterday

What Jobs Will AI Create—And Do The Luddites Have A Point? | In The Loop Episode 22

New Release: Mindset AI SDK 2.4 - Fonts Customization

How Enterprise CIOs Build & Buy Gen AI In 2025 | In The Loop Episode 21

Three Reasons Why Apple Is Cooked | In The Loop Episode 20

New Release: Mindset AI SDK 2.2 Multi-Tenancy Agents & Session Control

New Release: Mindset AI SDK 2.1 Theme Customization

Mary Meeker AI Trends 2025: Three Reasons Why AI Is Different From Any Other Tech In History | In The Loop Episode 19

What Happens To Entry-Level Jobs In The AI Era? | In The Loop Episode 18

What Is The Difference Between A2A And MCP? [With Videos]

Mindset AI Appoints Pip White as Non-Executive Director

Google I/O & Microsoft Build In 10 Minutes: What We Learned From The Two Biggest AI Conferences | In The Loop Episode 17

The Top Five AI Features SaaS Companies Are Shipping In 2025 (And Why They Work) | In The Loop Episode 16

New Release: Mindset AI SDK 2.0

Google, OpenAI, Meta, Anthropic & The Three Battles To Own All AI | In The Loop Episode 15

Should Conversational AI Agents Get Priority On Your E-Learning Platform’s Roadmap?

The Real State Of AI Adoption In 2025: What's AI Actually Used For? | In The Loop Episode 14

In The Loop Episode 13 | Cluely: The AI App That Made Cheating Viral—And Maybe Acceptable?

The New Playbook For Shipping AI Agents — Why Companies are Building on Mindset AI

In The Loop Episode 12 | Google Agent2Agent (A2A): The Future Of AI Agent Protocols Or A Flop?

How To Turn Your E-Learning Business Into An AI Coaching Solution

In The Loop Episode 11 | Shopify Memo: No Humans Hired Without AI Approval—Tobias Lütke's Vision

Mindset AI Raises £4.3 Million To Meet Growing Demand For Embedded AI Agents For SaaS Businesses

In The Loop Episode 10 | Does ChatGPT's Viral Image Generator & The Ghibli Craze Spell The End Of Art & Creativity?

How To Monetize Your AI Agents: A Product Leader's Guide To Revenue Generation In EdTech

In The Loop Episode 9 | Apple’s AI Crisis Exposed: Is It Having A Nokia Moment?

In The Loop Episode 8 | Model Context Protocol (MCP): The Newest AI Buzzword Explained

In The Loop Episode 7 | Vibe Coding: Will Developers Be Out Of A Job In Six Months? Dario Amodei’s Take

When To Use Agentic RAG—And What Is It Anyway?

In The Loop Episode 6 | Multi-Agent Systems: The Next Big Shift In AI—Yet People Have No Clue About Them

Agentic AI 101: Everything You Ever Wanted To Know About AI Agents But Never Dared Ask

In The Loop Episode 5 | The Rise Of Vertical AI Agents: Why SaaS Companies Should Be Worried

In The Loop Episode 4 | Why Microsoft's CEO Thinks Everyone's Wrong About AI Agents & AGI

AI Expert Interview: The Benefits And Drawbacks Of Agentic AI

In The Loop Episode 3 | The Real AI Challenge: Designing Human-Agent Interfaces That Work

AI Agents vs. Everything AI: All The Definitions You'll Ever Need

In The Loop Episode 2 | The Future of AI Agents: What’s Real, What’s Hype & What’s Next

When Did AI Agents Become A Thing? The History & Evolution Of Agentic AI

In The Loop Episode 1 | DeepSeek’s AI Breakthrough: Hype or Game-Changer? A No-Nonsense Breakdown

What Is The Future Of Agentic AI: Eight Predictions From A CPO

How To Use AI Agents To Fix Broken Search In Learning Platforms

The OpenAI Announcement Will Transform The Way Mindset AI Agents Engage With Users And Knowledge