DreamVideo-2

DreamVideo-2: Zero-Shot
Subject-Driven Video Customization
with Precise Motion Control

¹Fudan University ²Alibaba Group
³Nanyang Technological University ⁴Michigan State University
^*Project Leader ^†Corresponding Author

Abstract

Recent advances in customized video generation have enabled users to create videos tailored to both specific subjects and motion trajectories. However, existing methods often require complicated test-time fine-tuning and struggle with balancing subject learning and motion control, limiting their real-world applications. In this paper, we present \(\textbf{DreamVideo-2}\), a zero-shot video customization framework capable of generating videos with a specific subject and motion trajectory, guided by a single image and a bounding box sequence, respectively, and without the need for test-time fine-tuning. Specifically, we introduce reference attention, which leverages the model's inherent capabilities for subject learning, and devise a mask-guided motion module to achieve precise motion control by fully utilizing the robust motion signal of box masks derived from bounding boxes. While these two components achieve their intended functions, we empirically observe that motion control tends to dominate over subject learning. To address this, we propose two key designs: \(\textbf{1)}\) the masked reference attention, which integrates a blended latent mask modeling scheme into reference attention to enhance subject representations at the desired positions, and \(\textbf{2)}\) a reweighted diffusion loss, which differentiates the contributions of regions inside and outside the bounding boxes to ensure a balance between subject and motion control. Extensive experimental results on a newly curated dataset demonstrate that DreamVideo-2 outperforms state-of-the-art methods in both subject customization and motion control. The dataset, code, and models will be made publicly available.

Overall Framework of DreamVideo-2

Overall framework of \(\textbf{DreamVideo-2}\). During training, a random video frame is segmented to obtain the subject image with a blank background. The bounding boxes extracted from the training video are converted into binary box masks. Then, the subject image is treated as a single-frame video and processed in parallel with the video by masked reference attention that incorporates blended masks to learn the subject appearance. Meanwhile, box masks are fed into a motion module that includes a spatiotemporal encoder and a ControlNet for motion control. Both the masked reference attention and motion module are trained using a reweighted diffusion loss.

Joint Subject Customization and Motion Trajectory Control

"A corgi is surfing on a surfboard, cherry blossoms sway in the breeze"

Subject

Generated Video

Motion

"A red toy is dancing in the room"

Subject

Generated Video

Motion

"A cat is eating pizza"

Subject

Generated Video

Motion

"A plush toy sloth is walking on the beach"

Subject

Generated Video

Motion

"A corgi is swimming"

Subject

Generated Video

Motion

"A cat is walking under Eiffel Tower"

Subject

Generated Video

Motion

"A cartoon is walking on the snow"

Subject

Generated Video

Motion

"A fish is swimming underwater"

Subject

Generated Video

Motion

"A dog is running on the grass"

Subject

Generated Video

Motion

"A cat is walking on Mars"

Subject

Generated Video

Motion

"A fish is swimming underwater"

Subject

Generated Video

Motion

"A dog is running on the snow"

Subject

Generated Video

Motion

"A car is moving on the road"

Subject

Generated Video

Motion

"A cat is skateboarding in the park"

Subject

Generated Video

Motion

Qualitative Comparison of Joint Subject Customization and Motion Control

Subject

Motion

"A cat is skateboarding on the road"

DreamVideo

MotionBooth

DreamVideo-2

Subject

Motion

"A plush toy wolf is walking in the forest"

DreamVideo

MotionBooth

DreamVideo-2

Qualitative Comparison of Subject Customization

Subject

"A duck toy is moving on the road"

VideoBooth

DreamVideo

DreamVideo-2

Subject

"A fish is swimming"

VideoBooth

DreamVideo

DreamVideo-2

Qualitative Comparison of Motion Trajectory Control

Motion

"A dog is walking among the flowers"

MotionCtrl

Peekaboo

Direct-a-Video

DreamVideo-2

Motion

"A cat is surfing on a surfboard"

MotionCtrl

Peekaboo

Direct-a-Video

DreamVideo-2

Reference

@article{wei2024DreamVideo2, title={DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control}, author={Wei, Yujie and Zhang, Shiwei and Yuan, Hangjie and Wang, Xiang and Qiu, Haonan and Zhao, Rui and Feng, Yutong and Liu, Feng and Huang, Zhizhong and Ye, Jiaxin and Zhang, Yingya and Shan, Hongming}, journal={arXiv preprint arXiv:2410.13830}, year={2024} }