3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion
CVPR 2025
- Zhaoxi Chen1
- Jiaxiang Tang2
- Yuhao Dong1,3
- Ziang Cao1
- Fangzhou Hong1
- Yushi Lan1
- Tengfei Wang3
- Haozhe Xie1
- Tong Wu3,4
- Shunsuke Saito
- Liang Pan3
- Dahua Lin3,4
- Ziwei Liu1
1Nanyang Technological University 2 Peking University
3 Shanghai AI Laboratory 4 The Chinese University of Hong Kong
Abstract
3DTopia-XL scales high-quality 3D asset generation using Diffusion Transformer (DiT) built upon an expressive and efficient 3D representation, PrimX. The denoising process takes 5 seconds to generate a 3D PBR asset from text / image input which is ready for graphics pipeline to use.
Key Idea: Primitive Diffusion
To fully utilize the scalability of Diffusion Transformer (DiT), our key idea is a novel 3D representation, namely PrimX. It explicitly encodes the 3D shape, textures, and materials of a textured mesh into a compact N x D tensor. Each token in this representation is a volumetric primitive anchored to the shape surface with voxelized payload to encode SDF, RGB, and materials. Here we visualize the process of denoising PrimX:
Gallery: Image-to-3D
We use our singleview conditioned model for image-to-3D generation.
The exported 3D asset can be seamless imported into Blender for rendering.