Debugging Opus 4.6: Why Claude Code's Reasoning Depth Dropped 67% and What to Do About It

Evean66 · April 14, 2026, 9:40am

Debugging Opus 4.6: Why Claude Code’s Reasoning Depth Dropped 67% and What to Do About It

Two configuration changes. Zero announcements. A 67% drop in reasoning depth across 6,852 sessions.

If you’ve been building with Claude Code and noticed degraded output quality since February 2026, this post traces the exact root cause and walks through the fix.

Symptoms

Developers started reporting these patterns in early April:

Multi-step reasoning tasks that previously succeeded now fail or produce incomplete results
The model skips reading files it should analyze
Fabricated API references and non-existent function calls appear in output
Simple tasks still work fine, but anything requiring depth breaks down

The pattern is consistent: shallow tasks are unaffected, deep tasks are degraded.

Quantifying the Damage

BridgeBench Data

BridgeBench tests AI models on code analysis hallucination — 30 tasks, 175 questions, execution-verified ground truth.

Opus 4.6 moved from #2 (83.3% accuracy, ~17% fabrication) to #10 (68.3% accuracy, 33% fabrication) in the span of weeks.

The full picture:

Model	Accuracy	Fabrication Rate	Rank
Grok 4.20 Reasoning	91.8%	10.0%	#1
GPT-5.4	86.1%	16.7%	#2
Claude Opus 4.5	72.3%	27.9%	#6
Claude Opus 4.6	68.3%	33.0%	#10

Two things stand out:

Opus 4.6 is less accurate than its predecessor (4.5)
Sonnet 4.6 (72.4% accuracy) — a smaller model — outperforms it

Session-Level Analysis

An AMD executive’s analysis of 6,852 sessions quantified a 67% drop in reasoning depth. Developer Om Patel’s controlled A/B test (same prompt, 4.6 vs 4.5) showed 4.6 failing 5/5 times while 4.5 passed 5/5. His tweet documenting this reached 682K views.

Root Cause Analysis

Two changes in Anthropic’s defaults compound to create the degradation:

Change 1: Effort Level Default (March 3, 2026)

The effort parameter controls how much reasoning the model applies. It was changed from high to medium.

Under medium, the model applies a cost-saving heuristic: estimate task complexity, allocate proportional effort. The failure mode is systematic underestimation — complex tasks get classified as simple and receive insufficient reasoning.

Change 2: Adaptive Thinking (February 9, 2026)

A new mechanism lets the model dynamically allocate reasoning tokens per conversation turn. Under medium effort, this can result in zero reasoning tokens for certain turns.

The interaction between these changes is the core issue: medium effort + adaptive thinking = the model sometimes literally doesn’t think before responding.

Fix: Three Tiers

Tier 1: Per-Session Override

/effort max

Forces maximum reasoning depth for the current session. Must be re-applied each time.

Tier 2: Permanent Environment Configuration

# Add to .bashrc / .zshrc
export CLAUDE_CODE_EFFORT_LEVEL=max
export CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1

This persists across sessions and prevents adaptive thinking from zeroing out reasoning tokens.

Tier 3: Model Fallback

Switch to Opus 4.5: claude-opus-4-5-20251101

Trade-off: slower inference, higher token cost, but consistent reasoning quality.

Architecture Consideration: Model Routing

This incident highlights a practical problem for teams using LLMs in production: when a provider silently changes model behavior, you need the ability to reroute without code changes.

EvoLink’s unified API gateway provides a single endpoint for 30+ models. The Smart Router (evolink/auto) can route reasoning-heavy tasks to models with lower hallucination rates automatically. When model quality is a moving target, routing flexibility is an architectural requirement.

Timeline

Date	Event
Feb 9, 2026	Adaptive thinking introduced
Mar 3, 2026	Effort default: high → medium
Apr 10, 2026	Om Patel’s canary test tweet (682K views)
Apr 14, 2026	BridgeBench confirms #10 ranking, 33% fabrication

What to Watch

Whether Anthropic reverts the effort default or refines adaptive thinking
BridgeBench trajectory over coming weeks
Community-developed canary tests for detecting silent model changes

Sources: BridgeBench (bridgebench.ai/hallucination), @om_patel5 on X, GitHub Issue #42796, Digit.in, pasqualepillitteri.it

Topic		Replies	Views
Migrating from Claude Opus 4.6 to 4.7: Breaking Changes, Token Cost Reality, and Rollout Strategy Beginners	0	73	April 18, 2026
Opus 4.5 being the new best model for RAG Models	0	339	November 25, 2025
WFGY Core 2.0 as a Text-Only Reasoning Layer (System Prompt + A/B/C Harness) Intermediate	0	60	February 13, 2026
Models with high and/or controllable reasoning effort Models	2	280	December 5, 2025
Evaluating GPT-5.4 and Claude Opus 4.6 for Autonomous Code Agents: A Practitioner's Guide Models	0	495	March 13, 2026