Debugging Opus 4.6: Why Claude Code's Reasoning Depth Dropped 67% and What to Do About It

Debugging Opus 4.6: Why Claude Code’s Reasoning Depth Dropped 67% and What to Do About It

Two configuration changes. Zero announcements. A 67% drop in reasoning depth across 6,852 sessions.

If you’ve been building with Claude Code and noticed degraded output quality since February 2026, this post traces the exact root cause and walks through the fix.

Symptoms

Developers started reporting these patterns in early April:

  • Multi-step reasoning tasks that previously succeeded now fail or produce incomplete results
  • The model skips reading files it should analyze
  • Fabricated API references and non-existent function calls appear in output
  • Simple tasks still work fine, but anything requiring depth breaks down

The pattern is consistent: shallow tasks are unaffected, deep tasks are degraded.

Quantifying the Damage

BridgeBench Data

BridgeBench tests AI models on code analysis hallucination — 30 tasks, 175 questions, execution-verified ground truth.

Opus 4.6 moved from #2 (83.3% accuracy, ~17% fabrication) to #10 (68.3% accuracy, 33% fabrication) in the span of weeks.

The full picture:

Model Accuracy Fabrication Rate Rank
Grok 4.20 Reasoning 91.8% 10.0% #1
GPT-5.4 86.1% 16.7% #2
Claude Opus 4.5 72.3% 27.9% #6
Claude Opus 4.6 68.3% 33.0% #10

Two things stand out:

  1. Opus 4.6 is less accurate than its predecessor (4.5)
  2. Sonnet 4.6 (72.4% accuracy) — a smaller model — outperforms it

Session-Level Analysis

An AMD executive’s analysis of 6,852 sessions quantified a 67% drop in reasoning depth. Developer Om Patel’s controlled A/B test (same prompt, 4.6 vs 4.5) showed 4.6 failing 5/5 times while 4.5 passed 5/5. His tweet documenting this reached 682K views.

Root Cause Analysis

Two changes in Anthropic’s defaults compound to create the degradation:

Change 1: Effort Level Default (March 3, 2026)

The effort parameter controls how much reasoning the model applies. It was changed from high to medium.

Under medium, the model applies a cost-saving heuristic: estimate task complexity, allocate proportional effort. The failure mode is systematic underestimation — complex tasks get classified as simple and receive insufficient reasoning.

Change 2: Adaptive Thinking (February 9, 2026)

A new mechanism lets the model dynamically allocate reasoning tokens per conversation turn. Under medium effort, this can result in zero reasoning tokens for certain turns.

The interaction between these changes is the core issue: medium effort + adaptive thinking = the model sometimes literally doesn’t think before responding.

Fix: Three Tiers

Tier 1: Per-Session Override

/effort max

Forces maximum reasoning depth for the current session. Must be re-applied each time.

Tier 2: Permanent Environment Configuration

# Add to .bashrc / .zshrc
export CLAUDE_CODE_EFFORT_LEVEL=max
export CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1

This persists across sessions and prevents adaptive thinking from zeroing out reasoning tokens.

Tier 3: Model Fallback

Switch to Opus 4.5: claude-opus-4-5-20251101

Trade-off: slower inference, higher token cost, but consistent reasoning quality.

Architecture Consideration: Model Routing

This incident highlights a practical problem for teams using LLMs in production: when a provider silently changes model behavior, you need the ability to reroute without code changes.

EvoLink’s unified API gateway provides a single endpoint for 30+ models. The Smart Router (evolink/auto) can route reasoning-heavy tasks to models with lower hallucination rates automatically. When model quality is a moving target, routing flexibility is an architectural requirement.

Timeline

Date Event
Feb 9, 2026 Adaptive thinking introduced
Mar 3, 2026 Effort default: high → medium
Apr 10, 2026 Om Patel’s canary test tweet (682K views)
Apr 14, 2026 BridgeBench confirms #10 ranking, 33% fabrication

What to Watch

  • Whether Anthropic reverts the effort default or refines adaptive thinking
  • BridgeBench trajectory over coming weeks
  • Community-developed canary tests for detecting silent model changes

Sources: BridgeBench (bridgebench.ai/hallucination), @om_patel5 on X, GitHub Issue #42796, Digit.in, pasqualepillitteri.it

1 Like