Fast-WAM: Do World Action Models Need Test-time Future Imagination?

Tianyuan Yuan^1,2, Zibin Dong^1,2, Yicheng Liu^1,2, Hang Zhao^1,2

¹IIIS, Tsinghua University ²Galaxea AI

arXiv

Code

BibTeX

TL;DR

Do WAMs benefit mainly from explicit future imagination at inference, or from video modeling during training?

The main benefit comes from video co-training during training. Fast-WAM removes explicit future generation at test time, stays competitive, and runs much faster.

Three representative WAM paradigms and the Fast-WAM design.

Fast-WAM keeps video co-training during training but removes explicit future generation at inference time, enabling direct action generation from latent world representations in a single forward pass.

Abstract

World Action Models (WAMs) have emerged as a promising alternative to Vision-Language-Action (VLA) models for embodied control because they explicitly model how visual observations may evolve under action. Most existing WAMs follow an imagine-then-execute paradigm, incurring substantial test-time latency from iterative video denoising, yet it remains unclear whether explicit future imagination is actually necessary for strong action performance.

We ask whether WAMs need explicit future imagination at test time, or whether their benefit comes primarily from video modeling during training. To answer this, we propose Fast-WAM, a WAM architecture that retains video co-training during training but skips future prediction at test time. Across controlled variants, Fast-WAM remains competitive with imagine-then-execute variants, while removing video co-training causes a much larger performance drop.

Empirically, Fast-WAM achieves competitive results on LIBERO, RoboTwin, and real-world towel folding without embodied pretraining. It runs in real time with 190 ms latency, more than 4x faster than existing imagine-then-execute WAM designs.

Core Question

Do WAMs benefit mainly from explicit future imagination at inference, or from video modeling during training?

Design

Fast-WAM preserves world-model supervision during training while using a direct-policy interface at test time.

Finding

Video co-training matters much more than explicit test-time future generation for final control performance.

A WAM That Drops Future Imagination at Test Time

Fast-WAM architecture with a shared-attention Mixture-of-Transformer design.

Structured attention masks disentangle video co-training from action generation.

Fast-WAM is built on a pretrained video Diffusion Transformer backbone and an action expert DiT. During training, it jointly learns action prediction and video modeling so the shared visual backbone acquires stronger world-grounded representations.

At inference time, Fast-WAM keeps only the clean latent tokens of the current observation, processes them with the video backbone once, and directly generates actions without explicit future video denoising. This removes the main runtime bottleneck of imagine-then-execute WAMs.

Backbone: Wan2.2-5B video DiT with a 1B action expert

Runtime: 190 ms on a single RTX 5090D V2 32GB GPU

Results on Simulation and Real-World Tasks

We evaluate Fast-WAM on LIBERO, RoboTwin 2.0, and a real-world towel-folding task. Across all settings, the central pattern is consistent: Fast-WAM remains close to imagine-then-execute variants, while removing video co-training causes a much larger performance drop.

Real-world towel-folding benchmark on Galaxea R1 Lite.

Real-world towel-folding benchmark. Folding a deformable object requires long-horizon planning and precise closed-loop manipulation, making it a useful testbed for both success and execution efficiency.

RoboTwin 2.0

Method	Embodied PT.	Clean	Rand.	Average
π₀	✓	65.92	58.40	62.2
π_0.5	✓	82.74	76.76	79.8
Motus	✓	88.66	87.02	87.8
Motus from WAN2.2	✗	77.56	77.00	77.3
LingBot-VA	✓	92.90	91.50	92.2
LingBot-VA from WAN2.2	✗	80.60	--	80.6
Fast-WAM (Ours)	✗	91.88	91.78	91.8
Fast-WAM Variants
Fast-WAM	✗	91.88	91.78	91.8
Fast-WAM-Joint	✗	90.84	90.32	90.6
Fast-WAM-IDM	✗	91.16	91.34	91.3
Fast-WAM w.o. video co-train	✗	82.76	84.80	83.8

LIBERO

Method	Embodied PT.	Spatial	Object	Goal	Long	Average
OpenVLA	✓	84.7	88.4	79.2	53.7	76.5
π₀	✓	96.8	98.8	95.8	85.2	94.1
π_0.5	✓	98.8	98.2	98.0	92.4	96.9
LingBot-VA	✓	98.5	99.6	97.2	98.5	98.5
Motus	✓	96.8	99.8	96.6	97.6	97.7
Fast-WAM (Ours)	✗	98.2	100.0	97.0	95.2	97.6
Fast-WAM Variants
Fast-WAM	✗	98.2	100.0	97.0	95.2	97.6
Fast-WAM-Joint	✗	99.6	99.4	98.2	96.8	98.5
Fast-WAM-IDM	✗	98.8	97.8	97.8	97.6	98.0
Fast-WAM w.o. video co-train	✗	89.2	99.2	95.4	90.0	93.5

Fast-WAM real-world performance and latency.

Fast-WAM achieves strong real-world performance with substantially lower latency than imagine-then-execute baselines. Fast-WAM-IDM is slower at 810 ms, while Fast-WAM keeps latency at 190 ms.

Removing video co-training causes a dramatic degradation in both success and completion time, again reinforcing that co-training matters more than explicit future imagination at inference.

Video Co-Training Matters More Than Test-Time Imagination in WAMs

Fast-WAM revisits a basic question in World Action Models: do their gains come from explicit future imagination at test time, or from video modeling during training? Across simulation and real-world results, the answer appears consistent: Fast-WAM remains close to imagine-then-execute variants, while removing video co-training causes a much larger degradation.

This suggests the main value of video prediction in WAMs may lie in learning better world representations during training rather than generating future observations at test time.

BibTeX

@misc{yuan2026fastwam,
  title={Fast-WAM: Do World Action Models Need Test-time Future Imagination?},
  author={Tianyuan Yuan and Zibin Dong and Yicheng Liu and Hang Zhao},
  year={2026},
  note={arXiv preprint arXiv:2603.16666}
}