Search Results for an end-to-end Video-Action Model that couples a video Diffusion Transformer with an action Diffusion Transformer in a unified cascaded framework. Instead of relying on reconstructed future frames - Arthub.ai

Generate AI Images

Search Results for an end-to-end Video-Action Model that couples a video Diffusion Transformer with an action Diffusion Transformer in a unified cascaded framework. Instead of relying on reconstructed future frames

Explore AI generated designs, images, art and prompts by top community artists and designers.

Fantasy Characters Low Poly Sci Fi Furniture

Vision-Language-Action (VLA) models have emerged as a promising paradigm for robot learning , but their representations are still largely inherited from static image-text pretraining , leaving physical dynamics to be learned from comparatively limited action data. Generative video models , by contrast , encode rich spatiotemporal structure and implicit physics , making them a compelling foundation for robotic manipulation. But their potentials are not fully explored in the literature. To bridge the gap , we introduce DiT4DiT , an end-to-end Video-Action Model that couples a video Diffusion Transformer with an action Diffusion Transformer in a unified cascaded framework. Instead of relying on reconstructed future frames , DiT4DiT extracts intermediate denoising features from the video generation process and uses them as temporally grounded conditions for action prediction. We further propose a dual flow-matching objective with decoupled timesteps and noise scales for video prediction , hidden-state extraction , and action inference , enabling coherent joint training of both modules. ,

-1