Vision-Language-Action (VLA) models have emerged as a promising paradigm for robot learning, but the - Arthub.ai

Generate AI Images

Vision-Language-Action (VLA) models have emerged as a promising paradigm for robot learning, but the

Vision-Language-Action (VLA) models have emerged as a promising paradigm for robot learning , but their representations are still largely inherited from static image-text pretraining , leaving physical dynamics to be learned from comparatively limited action data. Generative video models , by contrast , encode rich spatiotemporal structure and implicit physics , making them a compelling foundation for robotic manipulation. But their potentials are not fully explored in the literature. To bridge the gap , we introduce DiT4DiT , an end-to-end Video-Action Model that couples a video Diffusion Transformer with an action Diffusion Transformer in a unified cascaded framework. Instead of relying on reconstructed future frames , DiT4DiT extracts intermediate denoising features from the video generation process and uses them as temporally grounded conditions for action prediction. We further propose a dual flow-matching objective with decoupled timesteps and noise scales for video prediction , hidden-state extraction , and action inference , enabling coherent joint training of both modules. ,

-1

{ "seed": "1695902905", "width": 1024, "height": 1024, "version": "SH_JuggernautXL", "sampler_name": "k_dpmpp_sde", "cfg_scale": 5, "steps": 20 }

Created on: 3/27/2026, 3:32:43 AM