MMM: Generative Masked Motion Model

被引:1
|
作者
Pinyoanuntapong, Ekkasit [1 ]
Wang, Pu [1 ]
Lee, Minwoo [1 ]
Chen, Chen [2 ]
机构
[1] Univ North Carolina Charlotte, Charlotte, NC 28223 USA
[2] Univ Cent Florida, Orlando, FL 32816 USA
关键词
D O I
10.1109/CVPR52733.2024.00153
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent advances in text-to-motion generation using diffusion and autoregressive models have shown promising results. However, these models often suffer from a trade-off between real-time performance, high fidelity, and motion editability. To address this gap, we introduce MMM, a novel yet simple motion generation paradigm based on Masked Motion Model. MMM consists of two key components: (1) a motion tokenizer that transforms 3D human motion into a sequence of discrete tokens in latent space, and (2) a conditional masked motion transformer that learns to predict randomly masked motion tokens, conditioned on the precomputed text tokens. By attending to motion and text tokens in all directions, MMM explicitly captures inherent dependency among motion tokens and semantic mapping between motion and text tokens. During inference, this allows parallel and iterative decoding of multiple motion tokens that are highly consistent with fine-grained text descriptions, therefore simultaneously achieving high-fidelity and high-speed motion generation. In addition, MMM has innate motion editability. By simply placing mask tokens in the place that needs editing, MMM automatically fills the gaps while guaranteeing smooth transitions between editing and non-editing parts. Extensive experiments on the HumanML3D and KIT-ML datasets demonstrate that MMM surpasses current leading methods in generating high-quality motion (evidenced by superior FID scores of 0.08 and 0.429), while offering advanced editing features such as body-part modification, motion in-betweening, and the synthesis of long motion sequences. In addition, MMM is two orders of magnitude faster on a single mid-range GPU than editable motion diffusion models. Our project page is available at https://exitudio.github.io/MMM-page/.
引用
收藏
页码:1546 / 1555
页数:10
相关论文
共 50 条
  • [21] BartSmiles: Generative Masked Language Models for Molecular Representations
    Chilingaryan, Gayane
    Tamoyan, Hovhannes
    Tevosyan, Ani
    Babayan, Nelly
    Hambardzumyan, Karen
    Navoyan, Zaven
    Aghajanyan, Armen
    Khachatrian, Hrant
    Khondkaryan, Lusine
    JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2024, 64 (15) : 5832 - 5843
  • [22] HumanMAC: Masked Motion Completion for Human Motion Prediction
    Chen, Ling-Hao
    Zhang, Jiawei
    Li, Yewen
    Pang, Yiren
    Xia, Xiaobo
    Liu, Tongliang
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 9510 - 9521
  • [23] A generative model for motion synthesis and blending using probability density estimation
    Okwechime, Dumebi
    Bowden, Richard
    ARTICULATED MOTION AND DEFORMABLE OBJECTS, PROCEEDINGS, 2008, 5098 : 218 - 227
  • [24] GENERATIVE MODEL AND ASSOCIATED METRIC FOR COORDINATED-MOTION TARGET GROUPS
    Legrand, Leo
    Giremus, Audrey
    Grivel, Eric
    Ratton, Laurent
    Joseph, Bernard
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 3984 - 3988
  • [25] Embodied learning of a generative neural model for biological motion perception and inference
    Schrodt, Fabian
    Layher, Georg
    Neumann, Heiko
    Butz, Martin V.
    FRONTIERS IN COMPUTATIONAL NEUROSCIENCE, 2015, 9
  • [26] DiffusionBERT: Improving Generative Masked Language Models with Diffusion Models
    He, Zhengfu
    Sun, Tianxiang
    Tang, Qiong
    Wang, Kuanning
    Huang, Xuanjing
    Qiu, Xipeng
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 4521 - 4534
  • [27] MaskPLAN: Masked Generative Layout Planning from Partial Input
    Zhang, Hang
    Savov, Anton
    Dillenburger, Benjamin
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 8964 - 8973
  • [28] Masked Image Inpainting Algorithm Based on Generative Adversarial Nets
    Cao Z.-Y.
    Niu S.-Z.
    Zhang J.-W.
    2018, Beijing University of Posts and Telecommunications (41): : 81 - 86
  • [29] Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity
    Pascual, Santiago
    Yeh, Chunghsin
    Tsiamas, Ioannis
    Serra, Joan
    COMPUTER VISION - ECCV 2024, PT LXXXVII, 2025, 15145 : 247 - 264
  • [30] CovarianceNet: Conditional Generative Model for Correct Covariance Prediction in Human Motion Prediction
    Postnikov, Aleksey
    Gamayunov, Aleksander
    Ferrer, Gonzalo
    2021 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2021, : 1031 - 1037