Summary

DualCoT-VLA addresses two failure modes in standard VLA models: inability to simultaneously capture fine-grained spatial details (low-level) and logical task structure (high-level), and high inference latency from step-by-step autoregressive CoT decoding. The method combines a Visual CoT for spatial understanding with a Linguistic CoT for task planning, and replaces sequential autoregressive reasoning with a parallel mechanism using learnable query tokens that complete both CoT paths in a single forward pass.

Key Contributions

  • Dual-modal CoT: Visual CoT branch for low-level spatial understanding, Linguistic CoT branch for high-level planning
  • Parallel CoT mechanism with two sets of learnable query tokens, collapsing multi-step autoregressive reasoning to a single forward pass
  • Eliminates compounding errors from sequential CoT decoding while preserving reasoning quality
  • State-of-the-art results on LIBERO and RoboCasa GR1 benchmarks plus real-robot platforms

Significance

DualCoT-VLA resolves the latency-vs-reasoning trade-off that has limited prior CoT-based VLAs, making dual-modal chain-of-thought practical for real-time robot deployment.