PSVT: End-to-End Multi-person 3D Pose and Shape Estimation with Progressive Video Transformers

被引：14

作者：

Qiu, Zhongwei ^{[1
,3
,4
]}

Yang, Qiansheng ^{[2
]}

Wang, Jian ^{[2
]}

Feng, Haocheng ^{[2
]}

Han, Junyu ^{[2
]}

Ding, Errui ^{[2
]}

Xu, Chang ^{[3
]}

Fu, Dongmei ^{[1
,4
]}

Wang, Jingdong ^{[2
]}

机构：

[1] Univ Sci & Technol Beijing, Sch Automat & Elect Engn, Beijing, Peoples R China

[2] Baidu, Beijing, Peoples R China

[3] Univ Sydney, Sydney, NSW, Australia

[4] Beijing Engn Res Ctr Ind Spectrum Imaging, Beijing, Peoples R China

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年

关键词：

D O I：

10.1109/CVPR52729.2023.02036

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Existing methods of multi-person video 3D human Pose and Shape Estimation (PSE) typically adopt a two-stage strategy, which first detects human instances in each frame and then performs single-person PSE with temporal model. However, the global spatio-temporal context among spatial instances can not be captured. In this paper, we propose a new end-to-end multi-person 3D Pose and Shape estimation framework with progressive Video Transformer, termed PSVT. In PSVT, a spatio-temporal encoder (STE) captures the global feature dependencies among spatial objects. Then, spatio-temporal pose decoder (STPD) and shape decoder (STSD) capture the global dependencies between pose queries and feature tokens, shape queries and feature tokens, respectively. To handle the variances of objects as time proceeds, a novel scheme of progressive decoding is used to update pose and shape queries at each frame. Besides, we propose a novel pose-guided attention (PGA) for shape decoder to better predict shape parameters. The two components strengthen the decoder of PSVT to improve performance. Extensive experiments on the four datasets show that PSVT achieves stage-of-the-art results.

引用

页码：21254 / 21263

页数：10

共 50 条

[31] Center point to pose: Multiple views 3D human pose estimation for multi-person
Liu, Huan
Wu, Jian
He, Rui
PLOS ONE, 2022, 17 (09):
[32] VoxelTrack: Multi-Person 3D Human Pose Estimation and Tracking in the Wild
Zhang, Yifu
Wang, Chunyu
Wang, Xinggang
Liu, Wenyu
Zeng, Wenjun
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (02) : 2613 - 2626
[33] Explicit Occlusion Reasoning for Multi-person 3D Human Pose Estimation
Liu, Qihao
Zhang, Yi
Bai, Song
Yuille, Alan
COMPUTER VISION - ECCV 2022, PT V, 2022, 13665 : 497 - 517
[34] END-TO-END MULTI-PERSON AUDIO/VISUAL AUTOMATIC SPEECH RECOGNITION
Braga, Otavio
Makino, Takaki
Siohan, Olivier
Liao, Hank
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6994 - 6998
[35] Snipper: A Spatiotemporal Transformer for Simultaneous Multi-Person 3D Pose Estimation Tracking and Forecasting on a Video Snippet
Zou, Shihao
Xu, Yuanlu
Li, Chao
Ma, Lingni
Cheng, Li
Vo, Minh
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (09) : 4921 - 4933
[36] RF-based Multi-view Pose Machine for Multi-Person 3D Pose Estimation
Xie, Chunyang
Zhang, Dongheng
Wu, Zhi
Yu, Cong
Hu, Yang
Sun, Qibin
Chen, Yan
2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 2669 - 2674
[37] Multi-View Multi-Person 3D Pose Estimation with Plane Sweep Stereo
Lin, Jiahao
Lee, Gim Hee
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 11881 - 11890
[38] VTP: volumetric transformer for multi-view multi-person 3D pose estimation
Yuxing Chen
Renshu Gu
Ouhan Huang
Gangyong Jia
Applied Intelligence, 2023, 53 : 26568 - 26579
[39] VTP: volumetric transformer for multi-view multi-person 3D pose estimation
Chen, Yuxing
Gu, Renshu
Huang, Ouhan
Jia, Gangyong
APPLIED INTELLIGENCE, 2023, 53 (22) : 26568 - 26579
[40] RPM 2.0: RF-Based Pose Machines for Multi-Person 3D Pose Estimation
Xie, Chunyang
Zhang, Dongheng
Wu, Zhi
Yu, Cong
Hu, Yang
Chen, Yan
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (01) : 490 - 503

← 1 2 3 4 5 →