SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints

[ 30° Difference in Azimuth ]

"A person slicing raw meat on a blue cutting board."

[ 30° Difference in Azimuth ]

"A person is doing outdoor activities."

[ Difference in Distance ]

"A sleek orange sports car on a long, straight track, its tires screeching as it accelerates, leaving behind a trail of smoke and dust."

[ 15° Difference in Elevation ]

"A horse and rider in motion along the shoreline of a beach."

[ 30° Difference in Azimuth ]

"A close-up view of a young Chinese child eating dumplings."

[ 30° Difference in Azimuth ]

"A woman singing into a microphone while wearing a scarf."

[ 15° Difference in Elevation ]

"A woman getting her hair styled in a salon."

[ 15° Difference in Elevation ]

"A man wearing a blue shirt and hat cooking in the kitchen."

[ 30° Difference in Azimuth ]

"A person skating on inline rollerblades in an indoor rink."

[ 20° Difference in Azimuth, 10° Difference in Elevation ]

"A gorilla is sitting on the ground, holding the carrot and taking small bites."

[ 30° Difference in Azimuth ]

"A majestic eagle is soaring through the clear blue sky."

[ 15° Difference in Elevation ]

"The setting is a workshop or studio environment, with the focus on an individual engaged in pottery-making."

[ Difference in Distance ]

"A young boy eating an apple."

[ Difference in Distance ]

"A person playing the piano."

[ 20° Difference in Azimuth, 10° Difference in Elevation ]

"A man is carefully slicing a red onion on a wooden cutting board, surrounded by various vegetables."

[ 20° Difference in Azimuth, 10° Difference in Elevation ]

"A woman singing into a microphone while wearing a scarf."

[ 20° Difference in Azimuth, 10° Difference in Elevation ]

"A close-up view of a small, fluffy dog in front of the camera."

[ 20° Difference in Azimuth, 10° Difference in Elevation ]

"A gorilla is sitting on the ground, holding the carrot and taking small bites."

[ 20° Difference in Elevation ]

"A close-up of an individual's hand sprinkling salt over a sizzling pan filled with vegetables and spices."

[ 30° Difference in Azimuth, 15° Difference in Elevation ]

"A person with long, dark hair wearing a maroon sweater is playing an acoustic guitar."

Recent advancements in video diffusion models demonstrate remarkable capabilities in simulating real-world dynamics and 3D consistency. This progress motivates us to explore the potential of these models to maintain dynamic consistency across diverse viewpoints, a feature highly sought after in applications like virtual filming. Unlike existing methods focused on multi-view generation of single objects for 4D reconstruction, our interest lies in generating open-world videos from arbitrary viewpoints, incorporating six degrees of freedom (6 DoF) camera poses. To achieve this, we propose a plug-and-play module that enhances a pre-trained text-to-video model for multi-camera video generation, ensuring consistent content across different viewpoints. Specifically, we introduce a multi-view synchronization module designed to maintain appearance and geometry consistency across these viewpoints. Given the scarcity of high-quality training data, we also propose a progressive training scheme that leverages multi-camera images and monocular videos as a supplement to Unreal Engine-rendered multi-camera videos. This comprehensive approach significantly benefits our model. Experimental results demonstrate the superiority of our proposed method over existing competitors and several baselines. Furthermore, our method enables intriguing extensions, such as re-rendering a video from multiple novel viewpoints.

To synthesis multi-camera synchronized videos based on the pre-trained text-to-video model, two components are newly introduced: the camera encoder projects the relative camera extrinsic parameters into embedding space; the inter-view synchronization module, as plugged in each Transformer block, modulates inter-view features under the guidance of inter-camera relationship. Only new components are trainable, while the pre-trained text-to-video model remains frozen.

To the best of our knowledge, multi-view real-world video generation has not been explored by previous works. To this end, we establish baseline approaches by first extracting the first frame of each view generated by SynCamMaster, and then feeding them into 1) image-to-video (I2V) generation method, i.e., SVD-XT [1] 2) state-of-the-art single-video camera control approach MotionCtrl [2] based on SVD and CameraCtrl [3] based on SVD-XT. Furthermore, we additionally train an I2V generation model based on the same T2V model used by SynCamMaster, denote as 'I2V-Ours'.

	Viewpoint 1 Viewpoint 2	Viewpoint 1 Viewpoint 2
	"An elephant wearing a colorful birthday hat is walking along the sandy beach."	"A blue bus drives across the iconic Tower Bridge in London."

We use SynCamMaster to synthesize the baseline methods' reference images since they cannot generate videos from various viewpoints.

[1] Blattmann, Andreas, et al. "Stable video diffusion: Scaling latent video diffusion models to large datasets." arXiv preprint arXiv:2311.15127 (2023).

[2] Wang, Zhouxia, et al. "Motionctrl: A unified and flexible motion controller for video generation." ACM SIGGRAPH 2024 Conference Papers. 2024.

[3] He, Hao, et al. "Cameractrl: Enabling camera control for text-to-video generation." arXiv preprint arXiv:2404.02101 (2024).

[4] Wu, Guanjun, et al. "4d gaussian splatting for real-time dynamic scene rendering." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

Synchronizing Multi-Camera Video Generation from Diverse Viewpoints

TL;DR: We propose SynCamMaster, an efficient method to lift pre-trained text-to-video models for open-domain multi-camera video generation from diverse viewpoints.

Demos

Cameras with 30° Difference in Azimuth

Cameras with Difference in Distance

Cameras with 15° Difference in Elevation

Cameras with 20° Difference in Azimuth and 10° Difference in Elevation

More Results

Abstract

Method

Comparisons

Results with Out-of-Distribution Camera Settings

Results on 4D Reconstruction

Additional Results for Rebuttal

Figure A:

Figure B: