Despite recent progress, video diffusion models still struggle to synthesize realistic videos involving highly dynamic motions or requiring fine-grained motion controllability. A central limitation lies in the scarcity of such examples in commonly used training datasets. To address this, we introduce DynaVid, a video synthesis framework that leverages synthetic motion data in training, which is represented as optical flow and rendered using computer graphics pipelines. This approach offers two key advantages. First, synthetic motion offers diverse motion patterns and precise control signals that are difficult to obtain from real data. Second, unlike rendered videos with artificial appearances, rendered optical flow encodes only motion and is decoupled from appearance, thereby preventing models from reproducing the unnatural look of synthetic videos. Building on this idea, DynaVid adopts a two-stage generation framework: a motion generator first synthesizes motion, and then a motion-guided video generator produces video frames conditioned on that motion. This decoupled formulation enables the model to learn dynamic motion patterns from synthetic data while preserving visual realism from real-world videos. We validate our framework on two challenging scenarios, vigorous human motion generation and extreme camera motion control, where existing datasets are particularly limited. Extensive experiments demonstrate that DynaVid improves the realism and controllability in dynamic motion generation and camera motion control.
Previous video diffusion models (e.g., CogVideoX, Wan, GEN3C, FloVD) struggle to synthesize realistic videos
with highly dynamic motions or extreme camera trajectories, such as breakdancing or 180-degree camera rotation.
In contrast, DynaVid synthesizes videos with improved realism and controllability in these challenging scenarios.
Commonly used training datasets contain very few samples with highly dynamic motion. As a result, models trained on such restricted data often fail to generate or control highly dynamic motion. Thus, we need new datasets!
Synthetic rendered video
Synthetic rendered motion (optical flow)
We leverage synthetic motion data, represented as optical flow and rendered using computer graphics pipelines, in training. This approach offers several advantages. Synthetic motion offers diverse motion patterns and precise control signals that are difficult to obtain from real data. In addition, unlike rendered videos with artificial appearances, rendered optical flow encodes only motion and is decoupled from appearance, thereby preventing models from reproducing the unnatural look of synthetic videos.
Overview of our dataset generation pipeline based on the Cycles renderer in Blender. We use it to synthesize the DynaVid-Human and DynaVid-Camera datasets. The pipeline consists of three stages: 3D scene construction, camera trajectory definition, and optical flow rendering.
Overview of DynaVid. (a) The motion generator first synthesizes motion and then produces video frames conditioned on the generated motion. For camera-controlled video synthesis, Plucker embeddings are provided as additional input. (b) Our framework adopts VACE to incorporate control signals such as Plucker embeddings or optical flow maps.
Capability to synthesize non-human dynamic motions.
Failure in generating multiple people and interacting with the surrounding environment