iHuman: Instant Animatable Digital Humans From Monocular Videos Research
This project introduces an efficient framework for creating high-quality, animatable 3D digital avatars from a single monocular video in just 15 seconds while also consuming a very low amount of GPU memory.
Creating personalized 3D avatars requires expensive multi-camera setups and lengthy processing times on high-end GPUs. Existing methods using Neural Radiance Fields (NeRF) or 3D Gaussian Splatting often take minutes to hours to optimize a single human subject. Moreover, these avatars do not faithfully reconstruct the high-level geometric details like the folds of clothes or facial features. This limits the adoption of digital human technology for real-time applications in telepresence, gaming, and the metaverse, where rapid generation and realism are essential.
Our objective is to develop a fast, simple, yet effective method for generating animatable digital humans. By taking inputs from a standard video camera of a person moving in the same pose along with their pose parameters, we reconstruct the individual using a hybrid representation of 3D Gaussians and mesh within a few seconds, capturing high-fidelity personalized details.
The iHuman framework achieves state-of-the-art performance in speed and efficiency for human reconstruction. This project creates digital avatars orders of magnitude faster than previous NeRF-based approaches, while operating with a very low memory footprint (approx. 2GB). Despite this speed, the system maintains similar, if not better, quality in visual details and faithfully reconstructs the high-quality geometric features of the person.
To ensure accuracy, the results were validated against previous state-of-the-art methods using standard industry benchmarks. These include metrics that measure Image Fidelity (such as PSNR, LPIPS, and SSIM) to evaluate visual sharpness and color accuracy, as well as 3D Structural Accuracy (using Vertex-to-Vertex) to ensure the 3D shape perfectly matches the original subject. The final reconstructed output is provided in a hybrid format, combining the photorealism of 3D Gaussians with the structural integrity of a 3D mesh, allowing the digital human to be instantly animated into any desired pose.
ECCV 2024 Poster Presentation