Synchronizing Multi-Camera Video Generation from Diverse Viewpoints

TL;DR: We propose SynCamMaster, an efficient method to lift pre-trained text-to-video models for open-domain multi-camera video generation from diverse viewpoints.


Demos

Text prompt:
Row 1: A hungry man enthusiastically devouring a steaming plate of spaghetti.
Row 2: A chef is expertly chopping onions in a well-equipped kitchen.
Row 3: A young and beautiful girl dressed in a pink dress, playing a grand piano.
Row 4: An elephant wearing a colorful birthday hat is walking along the sandy beach.

Cameras with 30° Difference in Azimuth


Cameras with Difference in Distance


Cameras with 15° Difference in Elevation


Cameras with 20° Difference in Azimuth and 10° Difference in Elevation



More Results

Abstract

Recent advancements in video diffusion models demonstrate remarkable capabilities in simulating real-world dynamics and 3D consistency. This progress motivates us to explore the potential of these models to maintain dynamic consistency across diverse viewpoints, a feature highly sought after in applications like virtual filming. Unlike existing methods focused on multi-view generation of single objects for 4D reconstruction, our interest lies in generating open-world videos from arbitrary viewpoints, incorporating six degrees of freedom (6 DoF) camera poses. To achieve this, we propose a plug-and-play module that enhances a pre-trained text-to-video model for multi-camera video generation, ensuring consistent content across different viewpoints. Specifically, we introduce a multi-view synchronization module designed to maintain appearance and geometry consistency across these viewpoints. Given the scarcity of high-quality training data, we also propose a progressive training scheme that leverages multi-camera images and monocular videos as a supplement to Unreal Engine-rendered multi-camera videos. This comprehensive approach significantly benefits our model. Experimental results demonstrate the superiority of our proposed method over existing competitors and several baselines. Furthermore, our method enables intriguing extensions, such as re-rendering a video from multiple novel viewpoints.

Method

To synthesis multi-camera synchronized videos based on the pre-trained text-to-video model, two components are newly introduced: the camera encoder projects the relative camera extrinsic parameters into embedding space; the inter-view synchronization module, as plugged in each Transformer block, modulates inter-view features under the guidance of inter-camera relationship. Only new components are trainable, while the pre-trained text-to-video model remains frozen.


Comparisons



Results with Out-of-Distribution Camera Settings

[ Difference in Distance ]

"A sleek orange sports car on a long, straight track, its tires screeching as it accelerates, leaving behind a trail of smoke and dust."


[ Difference in Distance ]

"A young and beautiful girl dressed in a pink dress, playing a grand piano."

Results on 4D Reconstruction

Reconstruction with the 4-view videos synthesized by SynCamMaster


4DGS [4] Input: Multi-Camera Videos Synthesized by SynCamMaster

"A young woman stands in the middle of a city street, joyfully raising her arms, suggesting she is dancing or celebrating."


4DGS Output: Rendered Video

Note: As noted in the 4DGS [4] paper, 4DGS struggles to optimize videos with large-scale motion. Therefore, we selected cases with limited dynamic range for 4D reconstruction. In the case, the woman is slowly rising from a semi-squat position.


Reconstruction with the 4-view videos (ground truth) from the Plenoptic Dataset


4DGS Input: Multi-Camera Videos in the Plenoptic Dataset


4DGS Output: Rendered Video





Additional Results for Rebuttal

Figure A:

                  Input Video                      Synthesize with random seed 1      Synthesize with random seed 2


                  Input Video                      Synthesize with random seed 1      Synthesize with random seed 2



Figure B:

    Input Video Vi                              Synthesized Video Vo                           Synthesized Video Vi'


    Input Video Vi                              Synthesized Video Vo                           Synthesized Video Vi'

[1] Blattmann, Andreas, et al. "Stable video diffusion: Scaling latent video diffusion models to large datasets." arXiv preprint arXiv:2311.15127 (2023).

[2] Wang, Zhouxia, et al. "Motionctrl: A unified and flexible motion controller for video generation." ACM SIGGRAPH 2024 Conference Papers. 2024.

[3] He, Hao, et al. "Cameractrl: Enabling camera control for text-to-video generation." arXiv preprint arXiv:2404.02101 (2024).

[4] Wu, Guanjun, et al. "4d gaussian splatting for real-time dynamic scene rendering." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.