Global-to-Local Modeling for Video-based 3D Human Pose and Shape Estimation

被引:11
|
作者
Shen, Xiaolong [1 ,2 ]
Yang, Zongxin [1 ]
Wang, Xiaohan [1 ]
Ma, Jianxin [2 ]
Zhou, Chang [2 ]
Yang, Yi [1 ]
机构
[1] Zhejiang Univ, CCAI, ReLER, Hangzhou, Zhejiang, Peoples R China
[2] Alibaba Grp, DAMO Acad, Hangzhou, Peoples R China
关键词
REPRESENTATION;
D O I
10.1109/CVPR52729.2023.00858
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video-based 3D human pose and shape estimations are evaluated by intra-frame accuracy and inter-frame smoothness. Although these two metrics are responsible for different ranges of temporal consistency, existing state-of-the-art methods treat them as a unified problem and use monotonous modeling structures (e.g., RNN or attention-based block) to design their networks. However, using a single kind of modeling structure is difficult to balance the learning of short-term and long-term temporal correlations, and may bias the network to one of them, leading to undesirable predictions like global location shift, temporal inconsistency, and insufficient local details. To solve these problems, we propose to structurally decouple the modeling of long-term and short-term correlations in an end-to-end framework, Global-to-Local Transformer (GLoT). First, a global transformer is introduced with a Masked Pose and Shape Estimation strategy for long-term modeling. The strategy stimulates the global transformer to learn more inter-frame correlations by randomly masking the features of several frames. Second, a local transformer is responsible for exploiting local details on the human mesh and interacting with the global transformer by leveraging cross-attention. Moreover, a Hierarchical Spatial Correlation Regressor is further introduced to refine intra-frame estimations by decoupled global-local representation and implicit kinematic constraints. Our GLoT surpasses previous state-of-the-art methods with the lowest model parameters on popular benchmarks, i.e., 3DPW, MPI-INF-3DHP, and Human3.6M. Codes are available at https://github.com/sxl142/GLoT.
引用
收藏
页码:8887 / 8896
页数:10
相关论文
共 50 条
  • [31] Global Adaptation meets Local Generalization: Unsupervised Domain Adaptation for 3D Human Pose Estimation
    Chai, Wenhao
    Jiang, Zhongyu
    Hwang, Jenq-Neng
    Wang, Gaoang
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 14609 - 14619
  • [32] A self-supervised spatio-temporal attention network for video-based 3D infant pose estimation
    Yin, Wang
    Chen, Linxi
    Huang, Xinrui
    Huang, Chunling
    Wang, Zhaohong
    Bian, Yang
    Wan, You
    Zhou, Yuan
    Han, Tongyan
    Yi, Ming
    MEDICAL IMAGE ANALYSIS, 2024, 96
  • [33] Context Modeling in 3D Human Pose Estimation: A Unified Perspective
    Ma, Xiaoxuan
    Su, Jiajun
    Wang, Chunyu
    Ci, Hai
    Wang, Yizhou
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 6234 - 6243
  • [34] Occlusion-Aware Networks for 3D Human Pose Estimation in Video
    Cheng, Yu
    Yang, Bo
    Wang, Bo
    Yan, Wending
    Tan, Robby T.
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 723 - 732
  • [35] Self-supervised 3D human pose estimation from video
    Gholami, Mohsen
    Rezaei, Ahmad
    Rhodin, Helge
    Ward, Rabab
    Wang, Z. Jane
    NEUROCOMPUTING, 2022, 488 : 97 - 106
  • [36] POCO: 3D Pose and Shape Estimation with Confidence
    Dwivedi, Sai Kumar
    Schmid, Cordelia
    Yi, Hongwei
    Black, Michael J.
    Tzionas, Dimitrios
    2024 INTERNATIONAL CONFERENCE IN 3D VISION, 3DV 2024, 2024, : 85 - 95
  • [37] EMDB: The Electromagnetic Database of Global 3D Human Pose and Shape in the Wild
    Kaufmann, Manuel
    Song, Jie
    Guo, Chen
    Shen, Kaiyue
    Jiang, Tianjian
    Tang, Chengcheng
    Zarate, Juan Jose
    Hilliges, Otmar
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 14586 - 14597
  • [38] SMPLer: Taming Transformers for Monocular 3D Human Shape and Pose Estimation
    Xu, Xiangyu
    Liu, Lijuan
    Yan, Shuicheng
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (05) : 3275 - 3289
  • [39] Generative estimation of 3D human pose using shape contexts matching
    Zhao, Xu
    Liu, Yuncai
    COMPUTER VISION - ACCV 2007, PT I, PROCEEDINGS, 2007, 4843 : 419 - 429
  • [40] Reducing Depth Ambiguity in 3D Human Pose and Body Shape Estimation
    Maruyama, Gakuto
    Kaneko, Naoshi
    Ito, Seiya
    Sumi, Kazuhiko
    FIFTEENTH INTERNATIONAL CONFERENCE ON QUALITY CONTROL BY ARTIFICIAL VISION, 2021, 11794