PPT: Token-Pruned Pose Transformer for Monocular and Multi-view Human Pose Estimation

被引:16
|
作者
Ma, Haoyu [1 ]
Wang, Zhe [1 ]
Chen, Yifei [2 ]
Kong, Deying [1 ]
Chen, Liangjian [3 ]
Liu, Xingwei [1 ]
Yan, Xiangyi [1 ]
Tang, Hao [3 ]
Xie, Xiaohui [1 ]
机构
[1] Univ Calif Irvine, Irvine, CA 92717 USA
[2] Tencent Inc, Shenzhen, Peoples R China
[3] Meta AI, Meta Real Lab, Menlo Pk, CA USA
来源
关键词
Vision transformer; Token pruning; Human pose estimation; Multi-view pose estimation;
D O I
10.1007/978-3-031-20065-6_25
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, the vision transformer and its variants have played an increasingly important role in both monocular and multi-view human pose estimation. Considering image patches as tokens, transformers can model the global dependencies within the entire image or across images from other views. However, global attention is computationally expensive. As a consequence, it is difficult to scale up these transformer-based methods to high-resolution features and many views. In this paper, we propose the token-Pruned Pose Transformer (PPT) for 2D human pose estimation, which can locate a rough human mask and performs self-attention only within selected tokens. Furthermore, we extend our PPT to multi-view human pose estimation. Built upon PPT, we propose a new cross-view fusion strategy, called human area fusion, which considers all human foreground pixels as corresponding candidates. Experimental results on COCO and MPII demonstrate that our PPT can match the accuracy of previous pose transformer methods while reducing the computation. Moreover, experiments on Human 3.6M and Ski-Pose demonstrate that our Multi-view PPT can efficiently fuse cues from multiple views and achieve new state-of-the-art results. Source code and trained model can be found at https://github.com/HowieMa/PPT.
引用
收藏
页码:424 / 442
页数:19
相关论文
共 50 条
  • [1] Epipolar Transformer for Multi-view Human Pose Estimation
    He, Yihui
    Yan, Rui
    Fragkiadaki, Katerina
    Yu, Shoou-, I
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2020), 2020, : 4466 - 4471
  • [2] Multi-View Pose Generator Based on Deep Learning for Monocular 3D Human Pose Estimation
    Sun, Jun
    Wang, Mantao
    Zhao, Xin
    Zhang, Dejun
    [J]. SYMMETRY-BASEL, 2020, 12 (07):
  • [3] Learning Monocular 3D Human Pose Estimation from Multi-view Images
    Rhodin, Helge
    Sporri, Jorg
    Katircioglu, Isinsu
    Constantin, Victor
    Meyer, Frederic
    Mueller, Erich
    Salzmann, Mathieu
    Fua, Pascal
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 8437 - 8446
  • [4] Multi-view segmentation based on human pose estimation in images
    Liu, Meng
    Qingxuan, Jia
    [J]. International Journal of Applied Mathematics and Statistics, 2013, 44 (14): : 104 - 111
  • [5] Human Pose Estimation through a Novel Multi-view Scheme
    Charco, Jorge L.
    Sappa, Angel D.
    Vintimilla, Boris X.
    [J]. PROCEEDINGS OF THE 17TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS (VISAPP), VOL 5, 2022, : 855 - 862
  • [6] Adaptive Multi-View and Temporal Fusing Transformer for 3D Human Pose Estimation
    Shuai, Hui
    Wu, Lele
    Liu, Qingshan
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (04) : 4122 - 4135
  • [7] Efficient Hierarchical Multi-view Fusion Transformer for 3D Human Pose Estimation
    Zhou, Kangkang
    Zhang, Lijun
    Lu, Feng
    Zhou, Xiang-Dong
    Shi, Yu
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 7512 - 7520
  • [8] Deep Semantic Graph Transformer for Multi-View 3D Human Pose Estimation
    Zhang, Lijun
    Zhou, Kangkang
    Lu, Feng
    Zhou, Xiang-Dong
    Shi, Yu
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 7205 - 7214
  • [9] Multi-View Human Pose Estimation in Human-Robot Interaction
    Xu, Chengjun
    Yu, Xinyi
    Wang, Zhengan
    Ou, Linlin
    [J]. IECON 2020: THE 46TH ANNUAL CONFERENCE OF THE IEEE INDUSTRIAL ELECTRONICS SOCIETY, 2020, : 4769 - 4775
  • [10] Object Pose Estimation from Monocular Image Using Multi-view Keypoint Correspondence
    Kundu, Jogendra Nath
    Rahul, M., V
    Ganeshan, Aditya
    Babu, R. Venkatesh
    [J]. COMPUTER VISION - ECCV 2018 WORKSHOPS, PT III, 2019, 11131 : 298 - 313