Global-to-Local Modeling for Video-based 3D Human Pose and Shape Estimation

被引:11
|
作者
Shen, Xiaolong [1 ,2 ]
Yang, Zongxin [1 ]
Wang, Xiaohan [1 ]
Ma, Jianxin [2 ]
Zhou, Chang [2 ]
Yang, Yi [1 ]
机构
[1] Zhejiang Univ, CCAI, ReLER, Hangzhou, Zhejiang, Peoples R China
[2] Alibaba Grp, DAMO Acad, Hangzhou, Peoples R China
关键词
REPRESENTATION;
D O I
10.1109/CVPR52729.2023.00858
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video-based 3D human pose and shape estimations are evaluated by intra-frame accuracy and inter-frame smoothness. Although these two metrics are responsible for different ranges of temporal consistency, existing state-of-the-art methods treat them as a unified problem and use monotonous modeling structures (e.g., RNN or attention-based block) to design their networks. However, using a single kind of modeling structure is difficult to balance the learning of short-term and long-term temporal correlations, and may bias the network to one of them, leading to undesirable predictions like global location shift, temporal inconsistency, and insufficient local details. To solve these problems, we propose to structurally decouple the modeling of long-term and short-term correlations in an end-to-end framework, Global-to-Local Transformer (GLoT). First, a global transformer is introduced with a Masked Pose and Shape Estimation strategy for long-term modeling. The strategy stimulates the global transformer to learn more inter-frame correlations by randomly masking the features of several frames. Second, a local transformer is responsible for exploiting local details on the human mesh and interacting with the global transformer by leveraging cross-attention. Moreover, a Hierarchical Spatial Correlation Regressor is further introduced to refine intra-frame estimations by decoupled global-local representation and implicit kinematic constraints. Our GLoT surpasses previous state-of-the-art methods with the lowest model parameters on popular benchmarks, i.e., 3DPW, MPI-INF-3DHP, and Human3.6M. Codes are available at https://github.com/sxl142/GLoT.
引用
收藏
页码:8887 / 8896
页数:10
相关论文
共 50 条
  • [21] 3D Human Pose Estimation in Video with Temporal and Spatial Transformer
    Peng, Sha
    Hu, Jiwei
    Proceedings of SPIE - The International Society for Optical Engineering, 2023, 12707
  • [22] Learnable Human Mesh Triangulation for 3D Human Pose and Shape Estimation
    Chun, Sungho
    Park, Sungbum
    Chang, Ju Yong
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 2849 - 2858
  • [23] ADVERSARIAL LEARNING ENHANCEMENT FOR 3D HUMAN POSE AND SHAPE ESTIMATION
    Sun, Yidian
    Zhang, Jiwei
    Wang, Wendong
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 3743 - 3747
  • [24] Evaluating Shape and Appearance Descriptors for 3D Human Pose Estimation
    Sedai, S.
    Bennamoun, M.
    Huynh, D. Q.
    2011 6TH IEEE CONFERENCE ON INDUSTRIAL ELECTRONICS AND APPLICATIONS (ICIEA), 2011, : 293 - 298
  • [25] DiffPose: SpatioTemporal Diffusion Model for Video-Based Human Pose Estimation
    Feng, Runyang
    Gao, Yixing
    Tse, Tze Ho Elden
    Ma, Xueqing
    Chang, Hyung Jin
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 14815 - 14826
  • [26] JointFusionNet: Parallel Learning Human Structural Local and Global Joint Features for 3D Human Pose Estimation
    Yuan, Zhiwei
    Yan, Yaping
    Du, Songlin
    Ikenaga, Takeshi
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT IV, 2022, 13532 : 113 - 125
  • [27] Multi-Person Absolute 3D Pose and Shape Estimation from Video
    Zhang, Kaifu
    Li, Yihui
    Guan, Yisheng
    Xi, Ning
    INTELLIGENT ROBOTICS AND APPLICATIONS, ICIRA 2021, PT III, 2021, 13015 : 189 - 200
  • [28] Unsupervised 3D Human Pose Estimation in Multi-view-multi-pose Video
    Sun, Cheng
    Thomas, Diego
    Kawasaki, Hiroshi
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 5959 - 5964
  • [29] 3D Menagerie: Modeling the 3D Shape and Pose of Animals
    Zuffi, Silvia
    Kanazawa, Angjoo
    Jacobs, David
    Black, Michael J.
    30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 5524 - 5532
  • [30] Optimization and Soft Constraints for Human Shape and Pose Estimation Based on a 3D Morphable Model
    Zhang, Dianyong
    Miao, Zhenjiang
    Chen, Shengyong
    Wan, Lili
    MATHEMATICAL PROBLEMS IN ENGINEERING, 2013, 2013