Method
Overview of our 3D human reconstruction pipeline. In this pipeline, the multi-view normal and RGB images are generated from the input image using a image-to-multi-view (I2MV) diffusion model. Then these images are converted into 3D representation using explicit human carving. In this work, we propose post-training the I2MV diffusion model to achieve better alignment with accurate poses in dynamic and acrobatic scenarios.
Overview of DrPose. Given a 3D human pose $\theta$ and input image $I$, the I2MV diffusion model $\epsilon_{\omega}$ is trained to minimize $\mathcal{L}_{\textrm{total}}=\mathcal{L}_{\textrm{reward}}+w_{\mathrm{KL}}\cdot \mathcal{L}_{\mathrm{KL}}$. Here, $\mathcal{L}_{\textrm{reward}}$ measures the distance between $\theta$ and the generated latent image $x_0$, while $\mathcal{L}_{\mathrm{KL}}$ computes the KL divergence between $\epsilon_w$ and the frozen initial I2MV diffusion model $\epsilon_{w_0}$.