3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion
Technical Report

Abstract

3DTopia-XL scales high-quality 3D asset generation using Diffusion Transformer (DiT) built upon an expressive and efficient 3D representation, PrimX. The denoising process takes 5 seconds to generate a 3D PBR asset from text / image input which is ready for graphics pipeline to use.



Key Idea: Primitive Diffusion

To fully utilize the scalability of Diffusion Transformer (DiT), our key idea is a novel 3D representation, namely PrimX. It explicitly encodes the 3D shape, textures, and materials of a textured mesh into a compact N x D tensor. Each token in this representation is a volumetric primitive anchored to the shape surface with voxelized payload to encode SDF, RGB, and materials. Here we visualize the process of denoising PrimX:



Gallery: Image-to-3D

We use our singleview conditioned model for image-to-3D generation.
The exported 3D asset can be seamless imported into Blender for rendering.



Exported GLB Mesh



Full Presentation

Citation

Acknowledgements

The website template is borrowed from Mip-NeRF