ThinkingVLA: Interleaved Vision and Language Reasoning for Robotic Manipulation

Summary

ThinkingVLA is a VLA model that decomposes manipulation planning into forward and inverse chain-of-thought reasoning within a unified Mixture-of-Transformers architecture. A forward CoT identifies the immediate subgoal and guides visual forecasting of the predicted next state; an inverse CoT then reasons about spatial relationships and action intent conditioned on that predicted image before generating the final action.

Key Contributions

Unified Mixture-of-Transformers (MoT) architecture jointly handling visual prediction and action generation
Forward CoT identifies subgoals and predicts the target visual state as an intermediate representation
Inverse CoT grounds action generation in the visually predicted future state, enabling richer spatial reasoning
Consistent outperformance over state-of-the-art baselines, especially on long-horizon manipulation tasks

Significance

Addresses the key limitation that standard VLAs map observations directly to actions without intermediate reasoning, making them brittle on long-horizon tasks; ThinkingVLA’s interleaved vision-language reasoning is a strong step toward interpretable, compositional robot policies.

Embodied Robotics Research

Explorer

ThinkingVLA: Interleaved Vision and Language Reasoning for Robotic Manipulation

Summary

Key Contributions

Significance

Links

Graph View

Table of Contents

Backlinks