World Action Models (WAMs) have emerged as a promising alternative to Vision-Language-Action (VLA)
models for embodied control because they explicitly model how visual observations may evolve under
action. Most existing WAMs follow an imagine-then-execute paradigm, incurring substantial test-time
latency from iterative video denoising, yet it remains unclear whether explicit future imagination is
actually necessary for strong action performance.
We ask whether WAMs need explicit future imagination at test time, or whether their benefit comes
primarily from video modeling during training. To answer this, we propose Fast-WAM, a
WAM architecture that retains video co-training during training but skips future prediction at test
time. Across controlled variants, Fast-WAM remains competitive with imagine-then-execute variants,
while removing video co-training causes a much larger performance drop.
Empirically, Fast-WAM achieves competitive results on LIBERO, RoboTwin, and real-world towel folding
without embodied pretraining. It runs in real time with 190 ms latency, more than
4x faster than existing imagine-then-execute WAM designs.