Edify 3D: Scalable High-Quality 3D Asset Generation

Edify 3D

Scalable High-Quality 3D Asset Generation

NVIDIA

The creation of high-quality 3D assets is critical for industries like video game design, extended reality, film production, and simulation, where 3D content must meet stringent production standards such as precise mesh structures, high-resolution textures, and material maps. Meeting these standards is time-consuming and requires specialized expertise, a demand that has fueled research into AI-driven 3D asset generation. However, the limited availability of 3D assets for model training poses challenges, highlighting the need for scalable, efficient solutions.

Edify 3D addresses these challenges by generating detailed, production-ready 3D assets within two minutes, producing organized UV maps, 4K textures, and PBR materials. Using multi-view diffusion models and Transformer-based reconstruction, Edify 3D can synthesize high-quality 3D assets from either text prompts or reference images, achieving superior efficiency and scalability.


Results

Edify 3D generates meshes with detailed geometry, sharp textures, and clear albedo colors representing the surface's base color. We visualize the PBR renderings, base albedo colors, and surface normals.

A full backpack with hanging space tools.
A phonograph made of wood and gold.
An orange factory robot arm.
A knight’s armor on a stand.
A spaceship pilot chair.
Cute isometric house, adobe style, desert tan.

Quad mesh topologies

The generated assets are also quad meshes with adaptive and organized topologies, enabling easy manipulation for editing and rendering, and integrating seamlessly into 3D workflows with high visual fidelity and flexibility.

(The 3D mesh visualizer may take a short while to load.)


Application: 3D Scene Generation

We demonstrate an application of Edify 3D to generate complex 3D scenes from simple text prompts. Leveraging Edify 3D as an asset generation API, our system uses LLMs to define scene layouts, object positions, and sizes for coherent, realistic compositions. This enables easily editable 3D scenes suited to applications in artistic design, 3D modeling, and embodied AI simulations.


Pipeline

Starting with a text description, a multi-view diffusion model generates RGB images of the specified object from multiple viewpoints. These images serve as input to a multi-view ControlNet, which synthesizes the corresponding surface normals. A reconstruction model then combines these RGB and normal images to predict a neural 3D representation as latent tokens, followed by isosurface extraction and mesh post-processing to create the object's geometry. To enhance texture quality, an upscaling ControlNet conditioning on mesh rasterizations produces high-resolution multi-view RGB images, which are subsequently back-projected onto the texture map.

diagram1

Multi-view Diffusion Model

The multi-view image generation process adapts text-to-image diffusion models into pose-aware, multi-view diffusion models by conditioning on camera poses. Given a text prompt and camera orientation, these models synthesize an object's appearance from multiple perspectives. Variants include a base model that generates RGB appearance, a ControlNet model that produces surface normals based on RGB synthesis and text, and an upscaling ControlNet for high-resolution output conditioned on texture and surface normals. Built on the Edify Image model, enhancements to the self-attention layer enable cross-view attention, while camera poses encoded through a lightweight MLP are integrated as time embeddings.

model1a

Our multi-view diffusion model scales effectively, with training on a greater number of viewpoints producing more natural and consistent images. During inference, the model can sample an arbitrary number of viewpoints while preserving multi-view consistency, facilitating comprehensive object coverage and enhancing the quality of downstream 3D reconstructions.

Image 3
Image 4

Reconstruction Model

Extracting 3D structure from images, commonly known as photogrammetry, is fundamental for many 3D reconstruction tasks. Our approach uses a Transformer-based model to generate 3D mesh geometry, texture, and material maps from multi-view images, with strong generalization to unseen objects, including synthesized 2D diffusion outputs. The model conditions on RGB and normal images to predict latent triplane representations, enabling SDF-based volume rendering of PBR properties. The neural SDF is converted to a 3D mesh via isosurface extraction, with PBR properties baked into texture and material maps. Post-processing includes quad mesh retopology, UV mapping, and baking PBR properties, resulting in an editable, design-ready asset suitable for artistic applications.

Our reconstruction model demonstrates effective scalability, with performance improving as the number of input viewpoints increases. Reconstruction quality also benefits from a higher number of training views, further enhancing accuracy. Additionally, reconstruction quality scales with the triplane token sizes using the same model, demonstrating its adaptability to available computational resources.

Albedo LPIPS Loss
Validation views
Input Views 4 4 (diag.) 8 16
4 0.0732 0.0791 0.0762 0.0768
4 (diag.) 0.0802 0.0756 0.0779 0.0783
8 0.0691 0.0698 0.0695 0.0699
16 0.0687 0.0689 0.0688 0.0687
Material L2 Loss
Validation views
Input Views 4 4 (diag.) 8 16
4 0.0015 0.0020 0.0017 0.0018
4 (diag.) 0.0024 0.0019 0.0022 0.0022
8 0.0013 0.0012 0.0013 0.0013
16 0.0012 0.0013 0.0013 0.0013
Depth L2 Loss
Validation views
Input Views 4 4 (diag.) 8 16
4 0.0689 0.0751 0.0720 0.0722
4 (diag.) 0.0704 0.0683 0.0694 0.0696
8 0.0626 0.0641 0.0633 0.0633
16 0.0613 0.0626 0.0619 0.0616

Citation

Please cite as NVIDIA et al., and use the following BibTeX for citation:

@article{nvidia2024edify3d,
  title     = {Edify 3D: Scalable High-Quality 3D Asset Generation},
  author    = {NVIDIA and Bala, Maciej and Cui, Yin and Ding, Yifan and Ge, Yunhao and Hao, Zekun and Hasselgren, Jon and Huffman, Jacob and Jin, Jingyi and Lewis, J.P. and Li, Zhaoshuo and Lin, Chen-Hsuan and Lin, Yen-Chen and Lin, Tsung-Yi and Liu, Ming-Yu and Luo, Alice and Ma, Qianli and Munkberg, Jacob and Shi, Stella and Wei, Fangyin and Xiang, Donglai and Xu, Jiashu and Zeng, Xiaohui and Zhang, Qinsheng},
  journal   = {arXiv preprint arXiv:2411.07135},
  year      = {2024}
}

Core Contributors

System design: Chen-Hsuan Lin, Xiaohui Zeng, Zhaoshuo Li, Zekun Hao, Ming-Yu Liu, Tsung-Yi Lin.
Multi-view diffusion model: Xiaohui Zeng, Qinsheng Zhang, Ming-Yu Liu, Tsung-Yi Lin.
Reconstruction model: Zhaoshuo Li, Chen-Hsuan Lin, Zekun Hao, Yen-Chen Lin, Ming-Yu Liu, Tsung-Yi Lin.
3D data processing: Zekun Hao, Fangyin Wei, Yin Cui, Yunhao Ge, Yifan Ding, Donglai Xiang, Qianli Ma, Jacob Munkberg, Jon Hasselgren, Chen-Hsuan Lin, Tsung-Yi Lin.
Mesh post-processing: Donglai Xiang, Qianli Ma, J.P. Lewis, Zekun Hao, Zhaoshuo Li, Fangyin Wei, Xiaohui Zeng, Jingyi Jin, Chen-Hsuan Lin, Tsung-Yi Lin.
Evaluation: Jingyi Jin, Xiaohui Zeng, Zhaoshuo Li, Qianli Ma, Yen-Chen Lin, Chen-Hsuan Lin, Tsung-Yi Lin.

Contributors

Maciej Bala, Jacob Huffman, Alice Luo, Stella Shi, Jiashu Xu.

Acknowledgements

We thank Lars Bishop, Sanja Fidler, Jun Gao, Jinwei Gu, Aaron Lefohn, Arun Mallya, Hanzi Mao, Seungjun Nah, Fitsum Reda, David Romero Guzman, Rohan Sawhney, Nicholas Sharp, Tianchang Shen, Peter Shipkov, Towaki Takikawa, Heng Wang, and Martin Watt for useful research discussions and prototyping. We also thank Margaret Albrecht, Arslan Ali, Amelia Barton, Lucas Brown, Matt Catrett, Douglas Chang, Steve Chappell, Gerardo Delgado Cabrera, Amol Fasale, Daniela Flamm Jackson, Sandra Froehlich, Devika Ghaisas, Brett Hamilton, Mohammad Harrim, Nathan Horrocks, Akan Huang, Sophia Huang, Pooya Jannaty, Pranjali Joshi, Tobias Lasser, Gabriele Leone, Aaron Licata, Ashlee Martino-Tarr, Alexandre Milesi, Pawel Morkisz, Jashojit Mukherjee, Brad Nemire, Dade Orgeron, Mitesh Patel, Jason Paul, Joel Pennington, Lyne Tchapmi, Jibin Varghese, Thomas Volk, Raju Wagwani, and Josh Young for feedback and support.