MMM: Generative Masked Motion Model

被引:1
|
作者
Pinyoanuntapong, Ekkasit [1 ]
Wang, Pu [1 ]
Lee, Minwoo [1 ]
Chen, Chen [2 ]
机构
[1] Univ North Carolina Charlotte, Charlotte, NC 28223 USA
[2] Univ Cent Florida, Orlando, FL 32816 USA
关键词
D O I
10.1109/CVPR52733.2024.00153
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent advances in text-to-motion generation using diffusion and autoregressive models have shown promising results. However, these models often suffer from a trade-off between real-time performance, high fidelity, and motion editability. To address this gap, we introduce MMM, a novel yet simple motion generation paradigm based on Masked Motion Model. MMM consists of two key components: (1) a motion tokenizer that transforms 3D human motion into a sequence of discrete tokens in latent space, and (2) a conditional masked motion transformer that learns to predict randomly masked motion tokens, conditioned on the precomputed text tokens. By attending to motion and text tokens in all directions, MMM explicitly captures inherent dependency among motion tokens and semantic mapping between motion and text tokens. During inference, this allows parallel and iterative decoding of multiple motion tokens that are highly consistent with fine-grained text descriptions, therefore simultaneously achieving high-fidelity and high-speed motion generation. In addition, MMM has innate motion editability. By simply placing mask tokens in the place that needs editing, MMM automatically fills the gaps while guaranteeing smooth transitions between editing and non-editing parts. Extensive experiments on the HumanML3D and KIT-ML datasets demonstrate that MMM surpasses current leading methods in generating high-quality motion (evidenced by superior FID scores of 0.08 and 0.429), while offering advanced editing features such as body-part modification, motion in-betweening, and the synthesis of long motion sequences. In addition, MMM is two orders of magnitude faster on a single mid-range GPU than editable motion diffusion models. Our project page is available at https://exitudio.github.io/MMM-page/.
引用
收藏
页码:1546 / 1555
页数:10
相关论文
共 50 条
  • [1] Masked Generative Distillation
    Yang, Zhendong
    Li, Zhe
    Shao, Mingqi
    Shi, Dachuan
    Yuan, Zehuan
    Yuan, Chun
    COMPUTER VISION, ECCV 2022, PT XI, 2022, 13671 : 53 - 69
  • [2] Fast generative adversarial networks model for masked image restoration
    Cao, Zhiyi
    Niu, Shaozhang
    Zhang, Jiwei
    Wang, Xinyi
    IET IMAGE PROCESSING, 2019, 13 (07) : 1124 - 1129
  • [3] Efficient generative model for motion deblurring
    Xiang, Han
    Sang, Haiwei
    Sun, Lilei
    Zhao, Yong
    JOURNAL OF ENGINEERING-JOE, 2020, 2020 (13): : 491 - 494
  • [4] Generative model for human motion recognition
    Excell, David
    Cemgil, A. Taylan
    Fitzgerald, William J.
    PROCEEDINGS OF THE 5TH INTERNATIONAL SYMPOSIUM ON IMAGE AND SIGNAL PROCESSING AND ANALYSIS, 2007, : 423 - 428
  • [5] A Two-Stage Deep Generative Model for Masked Face Synthesis
    Lee, Seungho
    SENSORS, 2022, 22 (20)
  • [6] Synthesis and Editing of Human Motion with Generative Human Motion Model
    Guo, Chengyu
    Ruan, Songsong
    Liang, Xiaohui
    2015 5TH INTERNATIONAL CONFERENCE ON VIRTUAL REALITY AND VISUALIZATION (ICVRV 2015), 2015, : 193 - 196
  • [7] A generative model based approach to motion segmentation
    Cremers, D
    Yuille, A
    PATTERN RECOGNITION, PROCEEDINGS, 2003, 2781 : 313 - 320
  • [8] MAGVIT: Masked Generative Video Transformer
    Yu, Lijun
    Cheng, Yong
    Sohn, Kihyuk
    Lezama, Jose
    Zhang, Han
    Chang, Huiwen
    Hauptmann, Alexander G.
    Yang, Ming-Hsuan
    Hao, Yuan
    Essa, Irfan
    Jiang, Lu
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10459 - 10469
  • [9] MaCow: Masked Convolutional Generative Flow
    Ma, Xuezhe
    Kong, Xiang
    Zhang, Shanghang
    Hovy, Eduard
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [10] MaskGIT: Masked Generative Image Transformer
    Chang, Huiwen
    Zhang, Han
    Jiang, Lu
    Liu, Ce
    Freeman, William T.
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 11305 - 11315