PSVT: End-to-End Multi-person 3D Pose and Shape Estimation with Progressive Video Transformers

被引:14
|
作者
Qiu, Zhongwei [1 ,3 ,4 ]
Yang, Qiansheng [2 ]
Wang, Jian [2 ]
Feng, Haocheng [2 ]
Han, Junyu [2 ]
Ding, Errui [2 ]
Xu, Chang [3 ]
Fu, Dongmei [1 ,4 ]
Wang, Jingdong [2 ]
机构
[1] Univ Sci & Technol Beijing, Sch Automat & Elect Engn, Beijing, Peoples R China
[2] Baidu, Beijing, Peoples R China
[3] Univ Sydney, Sydney, NSW, Australia
[4] Beijing Engn Res Ctr Ind Spectrum Imaging, Beijing, Peoples R China
关键词
D O I
10.1109/CVPR52729.2023.02036
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Existing methods of multi-person video 3D human Pose and Shape Estimation (PSE) typically adopt a two-stage strategy, which first detects human instances in each frame and then performs single-person PSE with temporal model. However, the global spatio-temporal context among spatial instances can not be captured. In this paper, we propose a new end-to-end multi-person 3D Pose and Shape estimation framework with progressive Video Transformer, termed PSVT. In PSVT, a spatio-temporal encoder (STE) captures the global feature dependencies among spatial objects. Then, spatio-temporal pose decoder (STPD) and shape decoder (STSD) capture the global dependencies between pose queries and feature tokens, shape queries and feature tokens, respectively. To handle the variances of objects as time proceeds, a novel scheme of progressive decoding is used to update pose and shape queries at each frame. Besides, we propose a novel pose-guided attention (PGA) for shape decoder to better predict shape parameters. The two components strengthen the decoder of PSVT to improve performance. Extensive experiments on the four datasets show that PSVT achieves stage-of-the-art results.
引用
收藏
页码:21254 / 21263
页数:10
相关论文
共 50 条
  • [21] Dynamic Graph Reasoning for Multi-person 3D Pose Estimation
    Qiu, Zhongwei
    Yang, Qiansheng
    Wang, Jian
    Fu, Dongmei
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 3521 - 3529
  • [22] Multi-person 3D pose estimation from unlabelled data
    Daniel Rodriguez-Criado
    Pilar Bachiller-Burgos
    George Vogiatzis
    Luis J. Manso
    Machine Vision and Applications, 2024, 35
  • [23] Multi-person 3D pose estimation from unlabelled data
    Rodriguez-Criado, Daniel
    Bachiller-Burgos, Pilar
    Vogiatzis, George
    Manso, Luis J.
    MACHINE VISION AND APPLICATIONS, 2024, 35 (03)
  • [24] Compressed Volumetric Heatmaps for Multi-Person 3D Pose Estimation
    Fabbri, Matteo
    Lanzi, Fabio
    Calderara, Simone
    Alletto, Stefano
    Cucchiara, Rita
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 7202 - 7211
  • [25] An end-to-end framework for unconstrained monocular 3D hand pose estimation
    Sharma, Sanjeev
    Huang, Shaoli
    PATTERN RECOGNITION, 2021, 115
  • [26] End-to-End 3D Human Pose Estimation Network With Multi-Layer Feature Fusion
    Cai, Guoci
    Zhang, Changshe
    Xie, Jingxiu
    Pan, Jie
    Li, Chaopeng
    Wu, Yiliang
    IEEE ACCESS, 2024, 12 : 89124 - 89134
  • [27] Multi-person 3D Pose Estimation from Monocular Image Sequences
    Li, Ran
    Xu, Nayun
    Lu, Xutong
    Xing, Yucheng
    Zhao, Haohua
    Niu, Li
    Zhang, Liqing
    NEURAL INFORMATION PROCESSING (ICONIP 2019), PT II, 2019, 11954 : 15 - 24
  • [28] Efficient Multi-Person Hierarchical 3D Pose Estimation for Autonomous Driving
    Gu, Renshu
    Wang, Gaoang
    Hwang, Jenq-Neng
    2019 2ND IEEE CONFERENCE ON MULTIMEDIA INFORMATION PROCESSING AND RETRIEVAL (MIPR 2019), 2019, : 163 - 168
  • [29] Mutual Adaptive Reasoning for Monocular 3D Multi-Person Pose Estimation
    Zhang, Juze
    Wang, Jingya
    Shi, Ye
    Gao, Fei
    Xu, Lan
    Yu, Jingyi
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 1788 - 1796
  • [30] Multi-Person 3D Human Pose Estimation from Monocular Images
    Dabral, Rishabh
    Gundavarapu, Nitesh B.
    Mitra, Rahul
    Sharma, Abhishek
    Ramakrishnan, Ganesh
    Jain, Arjun
    2019 INTERNATIONAL CONFERENCE ON 3D VISION (3DV 2019), 2019, : 405 - 414