Recent advancements in video diffusion models demonstrate remarkable capabilities in simulating real-world dynamics and 3D consistency. This progress motivates us to explore the potential of these models to maintain dynamic consistency across diverse viewpoints, a feature highly sought after in applications like virtual filming. Unlike existing methods focused on multi-view generation of single objects for 4D reconstruction, our interest lies in generating open-world videos from arbitrary viewpoints, incorporating six degrees of freedom (6 DoF) camera poses. To achieve this, we propose a plug-and-play module that enhances a pre-trained text-to-video model for multi-camera video generation, ensuring consistent content across different viewpoints. Specifically, we introduce a multi-view synchronization module designed to maintain appearance and geometry consistency across these viewpoints. Given the scarcity of high-quality training data, we also propose a progressive training scheme that leverages multi-camera images and monocular videos as a supplement to Unreal Engine-rendered multi-camera videos. This comprehensive approach significantly benefits our model. Experimental results demonstrate the superiority of our proposed method over existing competitors and several baselines. Furthermore, our method enables intriguing extensions, such as re-rendering a video from multiple novel viewpoints.
To synthesis multi-camera synchronized videos based on the pre-trained text-to-video model, two components are newly introduced: the camera encoder projects the relative camera extrinsic parameters into embedding space; the inter-view synchronization module, as plugged in each Transformer block, modulates inter-view features under the guidance of inter-camera relationship. Only new components are trainable, while the pre-trained text-to-video model remains frozen.
To the best of our knowledge, multi-view real-world video generation has not been explored by previous works. To this end, we establish baseline approaches by first extracting the first frame of each view generated by SynCamMaster, and then feeding them into 1) image-to-video (I2V) generation method, i.e., SVD-XT [1] 2) state-of-the-art single-video camera control approach MotionCtrl [2] based on SVD and CameraCtrl [3] based on SVD-XT. Furthermore, we additionally train an I2V generation model based on the same T2V model used by SynCamMaster, denote as 'I2V-Ours'.
Viewpoint 1 Viewpoint 2 | Viewpoint 1 Viewpoint 2 | |
---|---|---|
"An elephant wearing a colorful birthday hat is walking along the sandy beach." |
"A blue bus drives across the iconic Tower Bridge in London." |
We use SynCamMaster to synthesize the baseline methods' reference images since they cannot generate videos from various viewpoints.
[ Difference in Distance ]
"A sleek orange sports car on a long, straight track, its tires screeching as it accelerates, leaving behind a trail of smoke and dust."
[ Difference in Distance ]
"A young and beautiful girl dressed in a pink dress, playing a grand piano."
Reconstruction with the 4-view videos synthesized by SynCamMaster
4DGS [4] Input: Multi-Camera Videos Synthesized by SynCamMaster
"A young woman stands in the middle of a city street, joyfully raising her arms, suggesting she is dancing or celebrating."
4DGS Output: Rendered Video
Note: As noted in the 4DGS [4] paper, 4DGS struggles to optimize videos with large-scale motion. Therefore, we selected cases with limited dynamic range for 4D reconstruction. In the case, the woman is slowly rising from a semi-squat position.
Reconstruction with the 4-view videos (ground truth) from the Plenoptic Dataset
4DGS Input: Multi-Camera Videos in the Plenoptic Dataset
4DGS Output: Rendered Video
Input Video Synthesize with random seed 1 Synthesize with random seed 2
Input Video Synthesize with random seed 1 Synthesize with random seed 2
Input Video Vi Synthesized Video Vo Synthesized Video Vi'
Input Video Vi Synthesized Video Vo Synthesized Video Vi'
[1] Blattmann, Andreas, et al. "Stable video diffusion: Scaling latent video diffusion models to large datasets." arXiv preprint arXiv:2311.15127 (2023).
[2] Wang, Zhouxia, et al. "Motionctrl: A unified and flexible motion controller for video generation." ACM SIGGRAPH 2024 Conference Papers. 2024.
[3] He, Hao, et al. "Cameractrl: Enabling camera control for text-to-video generation." arXiv preprint arXiv:2404.02101 (2024).
[4] Wu, Guanjun, et al. "4d gaussian splatting for real-time dynamic scene rendering." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.