Weekly Research Digest — 2026-06-29

13 new entries this week across 3 topic areas.


Vision-Language-Action (VLA) Models

ReleaseVenueSignificance
qwen-vla-unifying-vision-language-action-modeling Qwen-VLAarXiv 2605.30280Alibaba unified VLA across manipulation, navigation, and embodiments; 97.9% LIBERO, strong cross-embodiment OOD
robots-need-more-than-vla-and-world-models Robots Need More than VLA and World ModelsarXiv 2606.06556Position paper from ETH/Tübingen/IIT: identifies four missing “interfaces” (data, embodiment, world-model, reward) blocking VLA scaling
geometric-action-model-robot-policy-learning Geometric Action Model (GAM)arXiv 2606.17046KAIST/ETH Zurich: repurposes 3D geometric foundation model as robot policy substrate; 55× faster than VLA/WAM baselines
la4vla-learning-to-act-without-seeing LA4VLAarXiv 2606.27295Language-action pretraining without vision decouples language-grounding from visual shortcuts; +45 pp real-world success

World Models for Robotics

ReleaseVenueSignificance
nvidia-cosmos-3-omnimodal-world-models-physical-ai NVIDIA Cosmos 3Technical Report 2606.02800World’s first open omnimodal world model (text/image/video/audio/action); #1 WorldModelBench Robot, #1 RoboArena policy
veo-act-frontier-video-models-robot-manipulation Veo-ActarXiv 2604.04502Benchmarks Veo-3 frontier video model as zero-shot robot planner; hierarchical Veo-3 + VLA framework closes the low-level gap
oa-wam-object-addressable-world-action-model OA-WAMarXiv 2605.06481Object-addressable slot states (persistent address + time-varying content) solve identity entanglement in holistic WAMs
flash-wam-modality-aware-distillation-world-action-models Flash-WAMarXiv 2606.05254Modality-aware consistency distillation solves video-action SNR asymmetry; enables few-step WAM inference
aha-wam-asynchronous-horizon-adaptive-world-action-modeling AHA-WAMarXiv 2606.09811Dual DiT (low-freq world planner + high-freq action DiT) with OVCR; 92.80% RoboTwin, 24 Hz, 4.6× speedup
efficient-wam-1b-low-cost-future-imagination Efficient-WAMarXiv 2606.100401B-parameter WAM with coarse future guidance; ~100 ms/chunk, 30× speedup, showing photorealistic video is unnecessary
kairos-native-world-model-stack-physical-ai KairosarXiv 2606.16533ACE Robotics 4B open world model; hybrid linear attention, #1 WorldModelBench Robot, beats 28B models at fraction of cost
imagewam-image-editing-vs-video-generation-world-action-models ImageWAMarXiv 2606.19531Replaces video generation with single image editing; 6× FLOP and 4× latency reduction — next-frame editing is sufficient

Reinforcement Learning for Robotics

ReleaseVenueSignificance
march-model-assisted-rl-humanoid-perceptive-control-sparse-footholds MARCHarXiv 2606.10288Model-assisted RL (CLF reward from simplified dynamics) + teacher-student distillation for humanoid sparse-foothold locomotion

Generated automatically. All entries verified via web search.