🐎 Ponymation: Learning 3D Animal Motions from Unlabeled Online Videos

1CUHK MMLab, 2Stanford University, 3UT Austin
(* Equal Contribution, † Equal Advising)

Our method learns a generative model of articulated motions from raw, unlabeled online videos. Given a single inference-time image, it generates diverse, plausible 4D motion sequences.


We introduce Ponymation, a new method for learning a generative model of articulated 3D animal motions from raw, unlabeled online videos. Unlike existing approaches for motion synthesis, our model does not require any pose annotations or parametric shape models for training, and is learned purely from a collection of raw video clips obtained from the Internet. We build upon a recent work, MagicPony, which learns articulated 3D animal shapes purely from single image collections, and extend it on two fronts. First, instead of training on static images, we augment the framework with a video training pipeline that incorporates temporal regularizations, achieving more accurate and temporally consistent reconstructions. Second, we learn a generative model of the underlying articulated 3D motion sequences via a spatio-temporal transformer VAE, simply using 2D reconstruction losses without relying on any explicit pose annotations. At inference time, given a single 2D image of a new animal instance, our model reconstructs an articulated, textured 3D mesh, and generates plausible 3D animations by sampling from the learned motion latent space.


Motion Generation Results

Given just a single test image, we can generate diverse 4D animations in a feedforward fashion within seconds, including abstract drawings and artifacts.


  title     = {Ponymation: Learning 3D Animal Motions from Unlabeled Online Videos},
  author    = {Keqiang Sun and Dor Litvak and Yunzhi Zhang and Hongsheng Li and Jiajun Wu and Shangzhe Wu},
  journal   = {arXiv preprint arXiv:2312.13604},
  year      = {2023}


We are grateful to Zizhang Li, Feng Qiu and Ruining Li for insightful discussions.