AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding

Summary

AffordanceVLA addresses the structural mismatch between VLM semantic feature spaces and robot action spaces by introducing structured affordance forecasting as a task-oriented intermediate representation. Three complementary modules — Which2Act (object-centric visual latent grounding), Where2Act, and How2Act — are integrated into a Mixture-of-Transformer (MoT) architecture with specialized experts that bridge high-level scene semantics to precise low-level control.

Key Contributions

Which2Act: object-centric grounding via visual latent prediction that suppresses task-irrelevant distractors
Three-component affordance framework (Which/Where/How2Act) providing spatially grounded, semantically conditioned, and action-coupled intermediate representations
Mixture-of-Transformer (MoT) architecture with task-specific expert routing trained via a three-stage progressive data curriculum
Automated affordance data-augmentation pipeline to overcome the scarcity of dense affordance labels in standard robot datasets

Significance

Bridges the semantic-to-motor gap in VLAs by decomposing action generation into structured affordance sub-problems, yielding more interpretable and precise manipulation policies without requiring privileged 3D or depth information.

Embodied Robotics Research

Explorer

AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding

Summary

Key Contributions

Significance

Links

Graph View

Table of Contents

Backlinks