DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models
DeepThinkVLA is a Vision-Language-Action (VLA) model that enhances robot reasoning through explicit deliberation. It refactors the policy into a 2.9B parameter hybrid decoder that generates a reasoning trace (Chain-of-Thought) before emitting action chunks.
- Paper: DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models
- Repository: https://github.com/OpenBMB/DeepThinkVLA
Overview
DeepThinkVLA identifies two necessary conditions for effective reasoning in robotics:
- Decoding Alignment: Reasoning and actions are generated using modality-appropriate mechanisms (causal attention for language and bidirectional attention for parallel action decoding).
- Causal Alignment: The reasoning chain is causally linked to task success via a two-stage training pipeline (SFT then Reinforcement Learning).
The model achieves a 97.0% success rate on the LIBERO benchmark and demonstrates significant robustness under distribution shifts in LIBERO-Plus.
Architecture
DeepThinkVLA inserts a <think> segment between observations and actions. Reasoning tokens are generated autoregressively, after which the decoder switches to bidirectional attention to emit action vectors in parallel. This hybrid approach resolves the modality conflict found in naive autoregressive VLA models.
Training Pipeline
- SFT Cold Start: Uses a two-stage CoT data engine to distill reasoning traces from cloud LVLMs and scale them to full trajectories.
- Outcome-Driven RL: Employs Grouped Reinforcement Policy Optimization (GRPO) to align the reasoning-action chain with sparse task-success rewards.
Citation
@article{yin2025deepthinkvla,
title={DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models},
author={Yin, Cheng and Lin, Yankai and Xu, Wang and Tam, Sikyuen and Zeng, Xiangrui and Liu, Zhiyuan and Yin, Zhouping},
journal={arXiv preprint arXiv:2511.15669},
year={2025}
}
- Downloads last month
- 238