Ponymation: Learning Articulated 3D Animal Motions from Unlabeled Online Videos

Abstract

We introduce Ponymation, a new method for learning a generative model of articulated 3D animal motions from raw, unlabeled online videos. Unlike existing approaches for motion synthesis, our model does not require any pose annotations or parametric shape models for training, and is learned purely from a collection of raw video clips obtained from the Internet. We build upon a recent work, MagicPony, which learns articulated 3D animal shapes purely from single image collections, and extend it on two fronts. First, instead of training on static images, we augment the framework with a video training pipeline that incorporates temporal regularizations, achieving more accurate and temporally consistent reconstructions. Second, we learn a generative model of the underlying articulated 3D motion sequences via a spatio-temporal transformer VAE, simply using 2D reconstruction losses without relying on any explicit pose annotations. At inference time, given a single 2D image of a new animal instance, our model reconstructs an articulated, textured 3D mesh, and generates plausible 3D animations by sampling from the learned motion latent space.

Video

Motion Generation Results

Given just a single test image, we can generate diverse 4D animations in a feedforward fashion within seconds, including abstract drawings and artifacts.

BibTeX

@inproceedings{sun2025ponymation,
      title={Ponymation: Learning Articulated 3D Animal Motions from Unlabeled Online Videos},
      author={Sun, Keqiang and Litvak, Dor and Zhang, Yunzhi and Li, Hongsheng and Wu, Jiajun and Wu, Shangzhe},
      booktitle={European Conference on Computer Vision},
      pages={100--119},
      year={2025},
      organization={Springer}
    }

Acknowledgements

We are grateful to Zizhang Li, Feng Qiu and Ruining Li for insightful discussions. The work is in part supported by the Stanford Institute for Human-Centered AI (HAI) and Samsung.

🐎 Ponymation: Learning Articulated 3D Animal Motions from Unlabeled Online Videos

Our method learns a generative model of articulated motions from raw, unlabeled online videos. Given a single inference-time image, it generates diverse, plausible 4D motion sequences.

Abstract

Video

Motion Generation Results

BibTeX

Acknowledgements