DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models

Summary

DualCoT-VLA addresses two failure modes in standard VLA models: inability to simultaneously capture fine-grained spatial details (low-level) and logical task structure (high-level), and high inference latency from step-by-step autoregressive CoT decoding. The method combines a Visual CoT for spatial understanding with a Linguistic CoT for task planning, and replaces sequential autoregressive reasoning with a parallel mechanism using learnable query tokens that complete both CoT paths in a single forward pass.

Key Contributions

Dual-modal CoT: Visual CoT branch for low-level spatial understanding, Linguistic CoT branch for high-level planning
Parallel CoT mechanism with two sets of learnable query tokens, collapsing multi-step autoregressive reasoning to a single forward pass
Eliminates compounding errors from sequential CoT decoding while preserving reasoning quality
State-of-the-art results on LIBERO and RoboCasa GR1 benchmarks plus real-robot platforms

Significance

DualCoT-VLA resolves the latency-vs-reasoning trade-off that has limited prior CoT-based VLAs, making dual-modal chain-of-thought practical for real-time robot deployment.

Embodied Robotics Research

Explorer

DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models

Summary

Key Contributions

Significance

Links

Graph View

Table of Contents

Backlinks