← WritingMarch 1, 2026

We Tested Reasoning Mode Across 160 Runs. It Improved Nothing.

By Christopher Swenor

We Tested Reasoning Mode Across 160 Runs. It Improved Nothing.

I tested reasoning mode across 8 model configurations and 160 dedicated benchmark runs. It improved zero of our four workflows. One provider silently disables reasoning when tools are present. Another doubled response time for no quality gain. At enterprise scale, leaving reasoning on by default is burning real money on nothing.

The pitch for reasoning mode is irresistible: flip a switch, the model thinks harder, you get sharper output. More internal chain-of-thought. More compute. More quality. In practice, across every tool-calling workflow I tested, it delivered more compute and identical quality. Here is what I found, why it happens, and how to decide whether reasoning mode deserves a place in your stack.

Finding 1: OpenAI Silently Ignores Reasoning When Tools Are Present

This was the headline finding, and the one that should make any team using OpenAI with tools stop and check their assumptions.

OpenAI's Chat Completions API accepts a reasoning_effort parameter (low, medium, high) that controls how deeply the model reasons before answering. I tested all three levels across our workflows. The outputs came back looking like photocopies of the base model: same cost, same speed, same quality.

Cost wiggled by a few percent. Speed wiggled by a few percent. Quality did not budge. That is not what a feature looks like when it is working. That is what a parameter looks like when it disappears into the floorboards.

When tool definitions are present in the Chat Completions API request, the reasoning_effort parameter is silently ignored. No error. No warning. No "reduced capability" flag in the response. I confirmed this across 60 runs at all three reasoning levels, and the outputs were statistically indistinguishable from the base model every time.

If you are paying for reasoning on tool-calling workflows through OpenAI's Chat Completions API, you are paying for a feature that does not activate. This is a silent failure mode, which makes it dangerous. Monitoring will not catch it because nothing throws. Cost dashboards will not catch it because spend stays flat (which is itself evidence that the reasoning path never kicked in). The only way to detect it is the boring way: run a controlled comparison.

An important caveat: I tested through the Chat Completions API, which is the endpoint most production systems use. OpenAI's newer Responses API may handle reasoning differently. If your system uses the Responses API, run your own comparison before drawing conclusions.

Finding 2: xAI Doubles Latency for Zero Quality Gain

If OpenAI's failure mode is silent, xAI's is impossible to miss. Grok really does spend extra time "thinking." You can watch the latency stack up. What you cannot find is the payoff.

The ugliest result was premium-tier response generation with reasoning enabled: a 2.04x latency penalty. Wall-clock time ballooned from 18 seconds to 37 seconds. Twice the wait. Same answer quality. The model took the scenic route and still arrived at the same place.

On the fast tier, reasoning added 35% to 62% more latency. On the premium tier, it added 43% to 104%. Quality did not move on any configuration.

At enterprise scale, the waste compounds fast. A team of 200 users running 50 response-generation calls per day at a 2x latency penalty burns 10,000 extra wait-seconds daily. That is 2.8 hours of user time wasted every day on reasoning tokens that produce identical output. Over a year, that is over 1,000 hours of productivity lost to a toggle that should have been off.

Finding 3: Google Came Closest, But Not Close Enough

Google's Gemini models were the only place where reasoning mode showed a pulse. Even then, the pulse was faint.

On the cheapest tier (Flash Lite), reasoning slightly improved structural compliance: outputs were a bit more likely to follow the expected format and include required sections. On Flash and Pro, the effect ranged from negligible to slight. Cost barely moved. Latency rose 9% to 16%.

If any provider came closest to a case for reasoning on tool-calling workflows, it was Google. But "slightly cleaner structure for a noticeable speed hit" is not much of a sales pitch. If a model keeps drifting from your schema, the better fix is tighter prompting or harder schema enforcement, not paid contemplation.

Why This Happens: Tools Are Already the Reasoning

Once you stare at the numbers long enough, the explanation starts to feel obvious.

Tool-calling workflows already come with reasoning baked into the architecture. A model with tool access has to analyze the request, decide which tools to invoke, form the tool-call arguments, process the results, and then synthesize a final answer. That is not a single leap. It is a chain of explicit decisions. Each tool call is already a reasoning step with real data attached.

Reasoning mode promises step-by-step decomposition before the answer. Tool-calling workflows already do that, except the steps are grounded. The model does not need to imagine what a profile might contain if it can fetch the profile. It does not need to speculate about the knowledge base if it can search it. Adding explicit reasoning tokens on top of that is duplication, not improvement. More tokens. More waiting. Same outcome.

There is a narrow window where reasoning earns its keep: pure mathematical reasoning, complex code generation, or dense text analysis with no tools to ground the work. In those cases the model must solve the entire problem in a single internal pass, and the extra tokens give it room to think. Most production AI systems do not look like that. They use tools, retrieval, schemas, and multi-step pipelines.

Do not enable reasoning mode by default. Treat it like any other expensive performance setting: benchmark it on your actual workload, compare the quality delta against the latency and cost hit, and keep it off unless the evidence is unambiguous. For tool-calling workflows specifically, the default should always be off. The 160 runs I spent on reasoning variants yielded one clear operational decision: turn it off and keep the speed. For an organization with 500 users, that decision reclaims over 2,500 hours of compute time per year.

More writing
← All posts