PSVT: End-to-End Multi-person 3D Pose and Shape Estimation with Progressive Video Transformers

被引:6
|
作者
Qiu, Zhongwei [1 ,3 ,4 ]
Yang, Qiansheng [2 ]
Wang, Jian [2 ]
Feng, Haocheng [2 ]
Han, Junyu [2 ]
Ding, Errui [2 ]
Xu, Chang [3 ]
Fu, Dongmei [1 ,4 ]
Wang, Jingdong [2 ]
机构
[1] Univ Sci & Technol Beijing, Sch Automat & Elect Engn, Beijing, Peoples R China
[2] Baidu, Beijing, Peoples R China
[3] Univ Sydney, Sydney, NSW, Australia
[4] Beijing Engn Res Ctr Ind Spectrum Imaging, Beijing, Peoples R China
关键词
D O I
10.1109/CVPR52729.2023.02036
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Existing methods of multi-person video 3D human Pose and Shape Estimation (PSE) typically adopt a two-stage strategy, which first detects human instances in each frame and then performs single-person PSE with temporal model. However, the global spatio-temporal context among spatial instances can not be captured. In this paper, we propose a new end-to-end multi-person 3D Pose and Shape estimation framework with progressive Video Transformer, termed PSVT. In PSVT, a spatio-temporal encoder (STE) captures the global feature dependencies among spatial objects. Then, spatio-temporal pose decoder (STPD) and shape decoder (STSD) capture the global dependencies between pose queries and feature tokens, shape queries and feature tokens, respectively. To handle the variances of objects as time proceeds, a novel scheme of progressive decoding is used to update pose and shape queries at each frame. Besides, we propose a novel pose-guided attention (PGA) for shape decoder to better predict shape parameters. The two components strengthen the decoder of PSVT to improve performance. Extensive experiments on the four datasets show that PSVT achieves stage-of-the-art results.
引用
收藏
页码:21254 / 21263
页数:10
相关论文
共 50 条
  • [1] End-to-End Multi-Person Pose Estimation with Transformers
    Shi, Dahu
    Wei, Xing
    Li, Liangqi
    Ren, Ye
    Tan, Wenming
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 11059 - 11068
  • [2] TesseTrack: End-to-End Learnable Multi-Person Articulated 3D Pose Tracking
    Reddy, N. Dinesh
    Guigues, Laurent
    Pishchulin, Leonid
    Eledath, Jayan
    Narasimhan, Srinivasa G.
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 15185 - 15195
  • [3] Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation
    Liu, Huan
    Chen, Qiang
    Tan, Zichang
    Liu, Jiang-Jiang
    Wang, Jian
    Su, Xiangbo
    Li, Xiaolong
    Yao, Kun
    Han, Junyu
    Ding, Errui
    Zhao, Yao
    Wang, Jingdong
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 14983 - 14992
  • [4] Multi-Person Absolute 3D Pose and Shape Estimation from Video
    Zhang, Kaifu
    Li, Yihui
    Guan, Yisheng
    Xi, Ning
    [J]. INTELLIGENT ROBOTICS AND APPLICATIONS, ICIRA 2021, PT III, 2021, 13015 : 189 - 200
  • [5] EFCPose: End-to-End Multi-Person Pose Estimation With Fully Convolutional Heads
    Wang, Haixin
    Zhou, Lu
    Chen, Yingying
    Chen, Zhiyang
    Tang, Ming
    Wang, Jinqiao
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) : 6039 - 6050
  • [6] E2Pose: Fully Convolutional Networks for End-to-End Multi-Person Pose Estimation
    Tobeta, Masakazu
    Sawada, Yoshihide
    Zheng, Ze
    Takamuku, Sawa
    Natori, Naotake
    [J]. 2022 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2022, : 532 - 537
  • [7] End-to-End Feature Pyramid Network for Real-Time Multi-Person Pose Estimation
    Luo, Dingli
    Du, Songlin
    Ikenaga, Takeshi
    [J]. PROCEEDINGS OF MVA 2019 16TH INTERNATIONAL CONFERENCE ON MACHINE VISION APPLICATIONS (MVA), 2019,
  • [8] End-to-end 3D Human Pose Estimation with Transformer
    Zhang, Bowei
    Cui, Peng
    [J]. 2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 4529 - 4536
  • [9] Multi-person 3D Pose Estimation and Tracking in Sports
    Bridgeman, Lewis
    Volino, Marco
    Guillemaut, Jean-Yves
    Hilton, Adrian
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2019), 2019, : 2487 - 2496
  • [10] IVT: An End-to-End Instance-guided Video Transformer for 3D Pose Estimation
    Qiu, Zhongwei
    Yang, Qiansheng
    Wang, Jian
    Fu, Dongmei
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 6174 - 6182