Global-to-Local Modeling for Video-based 3D Human Pose and Shape Estimation

被引:11
|
作者
Shen, Xiaolong [1 ,2 ]
Yang, Zongxin [1 ]
Wang, Xiaohan [1 ]
Ma, Jianxin [2 ]
Zhou, Chang [2 ]
Yang, Yi [1 ]
机构
[1] Zhejiang Univ, CCAI, ReLER, Hangzhou, Zhejiang, Peoples R China
[2] Alibaba Grp, DAMO Acad, Hangzhou, Peoples R China
关键词
REPRESENTATION;
D O I
10.1109/CVPR52729.2023.00858
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video-based 3D human pose and shape estimations are evaluated by intra-frame accuracy and inter-frame smoothness. Although these two metrics are responsible for different ranges of temporal consistency, existing state-of-the-art methods treat them as a unified problem and use monotonous modeling structures (e.g., RNN or attention-based block) to design their networks. However, using a single kind of modeling structure is difficult to balance the learning of short-term and long-term temporal correlations, and may bias the network to one of them, leading to undesirable predictions like global location shift, temporal inconsistency, and insufficient local details. To solve these problems, we propose to structurally decouple the modeling of long-term and short-term correlations in an end-to-end framework, Global-to-Local Transformer (GLoT). First, a global transformer is introduced with a Masked Pose and Shape Estimation strategy for long-term modeling. The strategy stimulates the global transformer to learn more inter-frame correlations by randomly masking the features of several frames. Second, a local transformer is responsible for exploiting local details on the human mesh and interacting with the global transformer by leveraging cross-attention. Moreover, a Hierarchical Spatial Correlation Regressor is further introduced to refine intra-frame estimations by decoupled global-local representation and implicit kinematic constraints. Our GLoT surpasses previous state-of-the-art methods with the lowest model parameters on popular benchmarks, i.e., 3DPW, MPI-INF-3DHP, and Human3.6M. Codes are available at https://github.com/sxl142/GLoT.
引用
下载
收藏
页码:8887 / 8896
页数:10
相关论文
共 50 条
  • [1] Video-Based 3D Human Pose Estimation Research
    Tao, Siting
    Zhang, Zhi
    2022 IEEE 17TH CONFERENCE ON INDUSTRIAL ELECTRONICS AND APPLICATIONS (ICIEA), 2022, : 485 - 490
  • [2] LOCAL TO GLOBAL TRANSFORMER FOR VIDEO BASED 3D HUMAN POSE ESTIMATION
    Ma, Haifeng
    Ke Lu
    Xue, Jian
    Niu, Zehai
    Gao, Pengcheng
    2022 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO WORKSHOPS (IEEE ICMEW 2022), 2022,
  • [3] Video-Based 3D pose estimation for residential roofing
    Wang, Ruochen
    Zheng, Liying
    Hawke, Ashley L.
    Carey, Robert E.
    Breloff, Scott P.
    Li, Kang
    Peng, Xi
    COMPUTER METHODS IN BIOMECHANICS AND BIOMEDICAL ENGINEERING-IMAGING AND VISUALIZATION, 2023, 11 (03): : 369 - 377
  • [4] Video-based body geometric aware network for 3D human pose estimation
    Chaonan Li
    Sheng Liu
    Lu Yao
    Siyu Zou
    Optoelectronics Letters, 2022, 18 : 313 - 320
  • [5] Video-based body geometric aware network for 3D human pose estimation
    LI Chaonan
    LIU Sheng
    YAO Lu
    ZOU Siyu
    Optoelectronics Letters, 2022, (05) : 313 - 320
  • [6] Kinematics modeling network for video-based human pose estimation
    Dang, Yonghao
    Yin, Jianqin
    Zhang, Shaojie
    Liu, Jiping
    Hu, Yanzhu
    PATTERN RECOGNITION, 2024, 150
  • [7] A multi-granular joint tracing transformer for video-based 3D human pose estimation
    Yingying Hou
    Zhenhua Huang
    Wentao Zhu
    Signal, Image and Video Processing, 2025, 19 (1)
  • [8] Video-based body geometric aware network for 3D man pose estimation
    Li Chaonan
    Liu Sheng
    Yao Lu
    Zou Siyu
    OPTOELECTRONICS LETTERS, 2022, 18 (05) : 313 - 320
  • [9] Multiview Video-Based 3-D Hand Pose Estimation
    Khaleghi L.
    Sepas-Moghaddam A.
    Marshall J.
    Etemad A.
    IEEE Transactions on Artificial Intelligence, 2023, 4 (04): : 896 - 909
  • [10] Global and local feature communications with transformers for 3D human pose estimation
    Changho No
    Minsik Lee
    Scientific Reports, 15 (1)