Weekly Research Digest — 2026-06-15

Weekly Research Digest — 2026-06-15

11 new entries this week across 3 topic areas.

Vision-Language-Action (VLA) Models

Release	Venue	Significance
seetraceact-visibility-aware-latent-planning-cross-embodiment-demos SeeTraceAct	arXiv 2606.02745	Demo-conditioned VLA with visibility-aware end-effector trace prediction; introduces RoboCasa-DC cross-embodiment benchmark; +12.5pp real-world success
3dthinkvla-latent-3d-priors-vla-co-training 3DThinkVLA	arXiv 2606.04436	Injects latent 3D geometry and reasoning priors into VLAs via co-training + anchor token; fixes prompt-induced reasoning gap without backbone changes
affordancevla-affordance-aware-vla-action-generation AffordanceVLA	arXiv 2606.06155	Which/Where/How2Act affordance modules + MoT architecture bridge VLM semantics to precise robot control; includes automated affordance data pipeline
memoryvla-plus-plus-temporal-modeling-memory-imagination-vla MemoryVLA++	arXiv 2606.09827	Cognitive-science-inspired temporal VLA with working memory, episodic memory, and imagination; +9/26/28% on general/memory/imagination-dependent tasks
hierarchical-vla-agents-orchestrating-robot-policies Hierarchical VLA Agents (Google DeepMind)	arXiv 2606.10267	First systematic options-framework study of Hi-VLA design; distils practical principles for planner/controller interfaces across short- and long-horizon tasks

World Models for Robotics

Release	Venue	Significance
tau0-wm-unified-video-action-world-model-agibot τ₀-WM (AgiBot)	arXiv 2606.01027	5B-parameter open robotic foundation model trained on 27.3K hours; unifies policy, video prediction, and action evaluation in one diffusion backbone
motionwam-foundation-world-action-model-humanoid-loco-manipulation MotionWAM	arXiv 2606.09215	Real-time (4.9 Hz, 7× faster than Cosmos Policy) unified WAM for humanoid loco-manipulation; removes upper/lower-body split with a single motion latent
making-foresight-actionable-agra-representation-alignment-wam Making Foresight Actionable (AGRA)	arXiv 2606.12217	AGRA objective aligns video diffusion features to a semantic encoder to fix the reconstruction-vs-control representation mismatch in WAMs (HKU/XPENG)
repwam-world-action-modeling-representation-visual-action-tokenizers RepWAM	arXiv 2606.13674	Replaces reconstruction tokenisers in WAMs with semantically aligned visual-action tokenizers; strong gains across real-world manipulation and simulation
targeting-world-models-adversarial-robot-learning-pipelines Targeting World Models (Adversarial)	arXiv 2606.09499	First formal study of data-poisoning attacks through world models in robot learning pipelines; highlights critical supply-chain security gap

Reinforcement Learning for Robotics

Release	Venue	Significance
sarm2-stage-aware-reward-modeling-self-improving-robot-manipulation SARM2 + SPIRAL	arXiv 2606.10305	Multi-task stage-aware reward model (MMoE + action-primitive vocabulary) + SPIRAL self-improvement loop; autonomous real-robot policy improvement without new demos

Generated automatically. All entries verified via web search.