PSVT: End-to-End Multi-person 3D Pose and Shape Estimation with Progressive Video Transformers

被引:14
|
作者
Qiu, Zhongwei [1 ,3 ,4 ]
Yang, Qiansheng [2 ]
Wang, Jian [2 ]
Feng, Haocheng [2 ]
Han, Junyu [2 ]
Ding, Errui [2 ]
Xu, Chang [3 ]
Fu, Dongmei [1 ,4 ]
Wang, Jingdong [2 ]
机构
[1] Univ Sci & Technol Beijing, Sch Automat & Elect Engn, Beijing, Peoples R China
[2] Baidu, Beijing, Peoples R China
[3] Univ Sydney, Sydney, NSW, Australia
[4] Beijing Engn Res Ctr Ind Spectrum Imaging, Beijing, Peoples R China
关键词
D O I
10.1109/CVPR52729.2023.02036
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Existing methods of multi-person video 3D human Pose and Shape Estimation (PSE) typically adopt a two-stage strategy, which first detects human instances in each frame and then performs single-person PSE with temporal model. However, the global spatio-temporal context among spatial instances can not be captured. In this paper, we propose a new end-to-end multi-person 3D Pose and Shape estimation framework with progressive Video Transformer, termed PSVT. In PSVT, a spatio-temporal encoder (STE) captures the global feature dependencies among spatial objects. Then, spatio-temporal pose decoder (STPD) and shape decoder (STSD) capture the global dependencies between pose queries and feature tokens, shape queries and feature tokens, respectively. To handle the variances of objects as time proceeds, a novel scheme of progressive decoding is used to update pose and shape queries at each frame. Besides, we propose a novel pose-guided attention (PGA) for shape decoder to better predict shape parameters. The two components strengthen the decoder of PSVT to improve performance. Extensive experiments on the four datasets show that PSVT achieves stage-of-the-art results.
引用
收藏
页码:21254 / 21263
页数:10
相关论文
共 50 条
  • [31] Center point to pose: Multiple views 3D human pose estimation for multi-person
    Liu, Huan
    Wu, Jian
    He, Rui
    PLOS ONE, 2022, 17 (09):
  • [32] VoxelTrack: Multi-Person 3D Human Pose Estimation and Tracking in the Wild
    Zhang, Yifu
    Wang, Chunyu
    Wang, Xinggang
    Liu, Wenyu
    Zeng, Wenjun
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (02) : 2613 - 2626
  • [33] Explicit Occlusion Reasoning for Multi-person 3D Human Pose Estimation
    Liu, Qihao
    Zhang, Yi
    Bai, Song
    Yuille, Alan
    COMPUTER VISION - ECCV 2022, PT V, 2022, 13665 : 497 - 517
  • [34] END-TO-END MULTI-PERSON AUDIO/VISUAL AUTOMATIC SPEECH RECOGNITION
    Braga, Otavio
    Makino, Takaki
    Siohan, Olivier
    Liao, Hank
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6994 - 6998
  • [35] Snipper: A Spatiotemporal Transformer for Simultaneous Multi-Person 3D Pose Estimation Tracking and Forecasting on a Video Snippet
    Zou, Shihao
    Xu, Yuanlu
    Li, Chao
    Ma, Lingni
    Cheng, Li
    Vo, Minh
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (09) : 4921 - 4933
  • [36] RF-based Multi-view Pose Machine for Multi-Person 3D Pose Estimation
    Xie, Chunyang
    Zhang, Dongheng
    Wu, Zhi
    Yu, Cong
    Hu, Yang
    Sun, Qibin
    Chen, Yan
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 2669 - 2674
  • [37] Multi-View Multi-Person 3D Pose Estimation with Plane Sweep Stereo
    Lin, Jiahao
    Lee, Gim Hee
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 11881 - 11890
  • [38] VTP: volumetric transformer for multi-view multi-person 3D pose estimation
    Yuxing Chen
    Renshu Gu
    Ouhan Huang
    Gangyong Jia
    Applied Intelligence, 2023, 53 : 26568 - 26579
  • [39] VTP: volumetric transformer for multi-view multi-person 3D pose estimation
    Chen, Yuxing
    Gu, Renshu
    Huang, Ouhan
    Jia, Gangyong
    APPLIED INTELLIGENCE, 2023, 53 (22) : 26568 - 26579
  • [40] RPM 2.0: RF-Based Pose Machines for Multi-Person 3D Pose Estimation
    Xie, Chunyang
    Zhang, Dongheng
    Wu, Zhi
    Yu, Cong
    Hu, Yang
    Chen, Yan
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (01) : 490 - 503