View publication

As diffusion models dominating visual content generation, efforts have been made to adapt these models for multi-view image generation to create 3D content. Traditionally, these methods implicitly learn 3D consistency by generating only RGB frames, which can lead to artifacts and inefficiencies in training. In contrast, we propose generating Normalized Coordinate Space (NCS) frames alongside RGB frames. NCS frames capture each pixel's global coordinate, providing strong pixel correspondence and explicit supervision for 3D consistency. Additionally, by jointly estimating RGB and NCS frames during training, our approach enables us to infer their conditional distributions during inference through an inpainting strategy applied during denoising. For example, given ground truth RGB frames, we can inpaint the NCS frames and estimate camera poses, facilitating camera estimation from unposed images. We train our model over a diverse set of datasets. Through extensive experiments, we demonstrate its capacity to integrate multiple 3D-related tasks into a unified framework, setting a new benchmark for foundational 3D model.

Figure 1: Pipeline of the proposed World-consistent Video Diffusion Model.

Related readings and updates.

Apple researchers are advancing AI and ML through fundamental research, and to support the broader research community and help accelerate progress in this field, we share much of our research through publications and engagement at conferences. This week, the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), will take place in Nashville, Tennessee. Apple is proud to once again participate in this important event for the…
Read more
Generating high-quality 3D content requires models capable of learning robust distributions of complex scenes and the real-world objects within them. Recent Gaussian-based 3D reconstruction techniques have achieved impressive results in recovering high-fidelity 3D assets from sparse input images by predicting 3D Gaussians in a feed-forward manner. However, these techniques often lack the extensive priors and expressiveness offered by Diffusion…
Read more