CoMusion: Towards Consistent Stochastic Human Motion Prediction via Motion Diffusion

University of Illinois Urbana-Champaign
ECCV 2024

Abstract

Stochastic Human Motion Prediction (HMP) aims to predict multiple future human pose sequences from observed ones. Most prior works learn motion distributions through encoding-decoding in the latent space, which does not preserve motion's spatial-temporal structure. While effective, these methods often require complex, multi-stage training and yield predictions that are inconsistent with the provided history and can be physically unrealistic. To address these issues, we propose CoMusion, a single-stage, end-to-end diffusion-based stochastic HMP framework. CoMusion is inspired from the insight that a smooth future pose initialization improves prediction performance, a strategy not previously utilized in stochastic models but evidenced in deterministic works. To generate such initialization, CoMusion's motion predictor starts with a Transformer-based network for initial reconstruction, and a graph convolutional network (GCN) to refine the prediction considering past observations in the discrete cosine transformation (DCT) space. Our method, facilitated by the Transformer-GCN module design and a proposed variance scheduler, excels in predicting accurate, realistic, and consistent motions while maintaining appropriate diversity. Benchmark results show that CoMusion surpasses prior methods across metrics, achieving at least a 35% relative improvement in fidelity metrics, while demonstrating superior generation quality.

CoMusion Architecture

Poster

BibTeX

BibTex Code Here