Max thinking is wasted on an empty room

An empty room and a room full of files and notes, the same figure thinking in each — Same model, two rooms. The thinking only pays off when there is something in the room to dig through.

This is a technical write-up of something I noticed this week. One new thing on the site, since this is the first post to use it: any jargon is , so you can hover for a translation, and there is a full glossary at the foot. The piece itself stays technical. The glossary is just there so no single word stops you.

What I was actually doing

Modern models expose a setting for how hard they think. Turn it up and you are running : the model spends far more , and far more , working the problem before it replies, instead of answering on instinct.

I ran the newest model at that top setting all week, against a fully loaded . The interesting part was not "it is smarter." It was discovering what that extra thinking actually needs in order to be worth anything.

The short version: hard thinking is wasted unless you have given the model something to think with. Effort is a multiplier, and a multiplier on nothing is nothing.

1. The room matters more than the model

Here is the bit that surprised me. The jump between the last and this one did not feel like smarter . It felt like the model got better at using the room I gave it.

By "the room" I mean the : the briefs, the notes, the earlier decisions, the examples I load in before I ask my question. All of that sits in the model's , which is just its short-term memory for the task.

Drop a hard-thinking model into an empty room and you get a slow, confident guess. Drop it into a room you have actually furnished and it does something different. It digs.

A figure surrounded by labelled drawers of briefs, notes and decisions — The intelligence is not only in the model. It is in the room you build before you ask.

The mechanism underneath is . With a large window loaded, the extra effort does not go into inventing more. It goes into reading what is already in front of it, cross-referencing it, and grounding the reply in it. The are fixed and identical for everyone, so they are not the variable you control. What sits in the room is. This is also why the big context window matters more than it looks. A larger room is not about fitting more text. It is more evidence for the same effort to dig through.

So the takeaway is not "buy a better model." It is "build a better room."

2. The slow part is the feature, not a fault

When you turn thinking up, the wait gets noticeably longer. My instinct was to read that as the tool being slow.

It is not slow. It is busy.

During that pause the model is going back through everything in the room, looking for evidence before it commits to an answer. The wait is the work. If you judge the model by how fast it replies, you are measuring the wrong thing. Judge it by the quality of what it found.

Concretely, the extra is spent re-reading the source material, testing claims against it, holding more than one candidate answer at once, and dropping the ones the room does not support. It scales with how much context there is to search. A thin room answers fast because there is nothing in it to read. A deep room answers slowly because it is actually being read. The speed is a readout of how much digging happened, not a measure of the model's intelligence.

A timeline showing a short pause leading to a guess versus a long pause leading to a researched answer — The longer pause is the model reading the room for evidence, not stalling.

This matters most for work, where the model runs several steps on its own. A model that checks the evidence before each step is worth far more than a fast one that confidently walks off a cliff.

3. It works backwards from evidence

This is the real shift, and it is the thing I had not seen before.

The newer model goes looking for proof in your room first, then forms its answer around what it found. The older one was more likely to answer first and justify afterwards.

That order matters. Working forwards from a guess gives you something that sounds right. Working backwards from evidence gives you something that is right, or at least something you can check. It is the best I have seen from any model, and it only shows up when there is evidence in the room to find.

In practice the tell is grounding. Ask it something the room can answer and it ties the reply back to the specific material you loaded, rather than to a general impression. The failure mode of the older behaviour was the confident, plausible answer that quietly had nothing behind it. Evidence-first does not delete that risk, but it pushes it down, because the answer and its support are built in the same pass, instead of the support being bolted on after the fact.

Two paths, one starting from a guess and one starting from evidence, ending at different answers — Answer-first sounds right. Evidence-first is right, or at least checkable.

How to build the room

"Build the room" is easy to say and vague to act on, so here is the practical version. The room is everything in the that sits there before your question, and its quality caps the answer. The model can only dig through what you put in front of it.

In practice that means loading, ahead of the question: the source material itself, the briefs and constraints that define the task, the decisions you have already made and the reasons behind them, and a few worked examples of the output you actually want. Not a summary of those things. The things themselves. A summary is pre-chewed, and pre-chewed material leaves nothing to dig for.

The discipline is counter-intuitive. You spend more time assembling the room than writing the prompt. The prompt gets shorter. The room gets richer. That trade is the method.

When the dig is not worth it

Max effort is not a default to leave on. It is a cost, paid in and , and it only pays back when there is a room worth searching.

For a thin context or a simple lookup, turn it down. Otherwise you wait longer for a guess that a lighter setting would have produced just as well. Match the effort to the room. A deep room earns a deep dig. An empty one does not, and paying for one is how max effort gets its bad reputation.

That is also the honest limit of the finding. The model genuinely improved. But the part you control, and the part that moved most of my results, is the room.

Have a play

The widget below is the whole idea in one toy. Add or remove layers of context, ask the question, and watch how much the model can dig before it answers. An empty room gives you a guess. A full one gives you a researched reply.

the evidence dig
Toggle what is in the room, then ask.

The one rule to take away

Thinking effort and context are a pair. You cannot turn one up and ignore the other.

Max effort on a thin room is an expensive guess. Max effort on a room you have built is a researcher.

Build the room before you ask the question.

The model is the part everyone talks about. The room is the part that actually decides the answer, and it is the part you control.