DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models

DeepThinkVLA is a Vision-Language-Action (VLA) model that enhances robot reasoning through explicit deliberation. It refactors the policy into a 2.9B parameter hybrid decoder that generates a reasoning trace (Chain-of-Thought) before emitting action chunks.

Paper: DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models
Repository: https://github.com/OpenBMB/DeepThinkVLA

Overview

DeepThinkVLA identifies two necessary conditions for effective reasoning in robotics:

Decoding Alignment: Reasoning and actions are generated using modality-appropriate mechanisms (causal attention for language and bidirectional attention for parallel action decoding).
Causal Alignment: The reasoning chain is causally linked to task success via a two-stage training pipeline (SFT then Reinforcement Learning).

The model achieves a 97.0% success rate on the LIBERO benchmark and demonstrates significant robustness under distribution shifts in LIBERO-Plus.

Architecture

DeepThinkVLA inserts a <think> segment between observations and actions. Reasoning tokens are generated autoregressively, after which the decoder switches to bidirectional attention to emit action vectors in parallel. This hybrid approach resolves the modality conflict found in naive autoregressive VLA models.

Training Pipeline

SFT Cold Start: Uses a two-stage CoT data engine to distill reasoning traces from cloud LVLMs and scale them to full trajectories.
Outcome-Driven RL: Employs Grouped Reinforcement Policy Optimization (GRPO) to align the reasoning-action chain with sparse task-success rewards.

Citation

@article{yin2025deepthinkvla,
  title={DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models},
  author={Yin, Cheng and Lin, Yankai and Xu, Wang and Tam, Sikyuen and Zeng, Xiangrui and Liu, Zhiyuan and Yin, Zhouping},
  journal={arXiv preprint arXiv:2511.15669},
  year={2025}
}

Downloads last month: 238

Safetensors

Model size

3B params

Tensor type

BF16

Video Preview

Robotics

Collection including yinchenghust/deepthinkvla_libero_cot_rl

DeepThinkVLA

Collection

Enhancing Reasoning Capability of Vision-Language-Action Models • 5 items • Updated Mar 19

Paper for yinchenghust/deepthinkvla_libero_cot_rl

DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models

Paper • 2511.15669 • Published Oct 31, 2025 • 1