Integrating Vision Transformer-Based Bilinear Pooling and Attention Network Fusion of RGB and Skeleton Features for Human Action Recognition

被引:0
|
作者
Yaohui Sun
Weiyao Xu
Xiaoyi Yu
Ju Gao
Ting Xia
机构
[1] University of Electronic Science and Technology of China,School of Automation Engineering
[2] Zaozhuang University,School of Opto
关键词
Human action recognition; Multi-modal; Self-attention; Feature fusion;
D O I
暂无
中图分类号
学科分类号
摘要
In this paper, we propose VT-BPAN, a novel approach that combines the capabilities of Vision Transformer (VT), bilinear pooling, and attention network fusion for effective human action recognition (HAR). The proposed methodology significantly enhances the accuracy of activity recognition through the following advancements: (1) The introduction of an effective two-stream feature pooling and fusion mechanism that combines RGB frames and skeleton data to augment the spatial–temporal feature representation. (2) The development of a spatial lightweight vision transformer that mitigates computational costs. The evaluation of this framework encompasses three widely employed video action datasets, demonstrating that the proposed approach achieves performance on par with state-of-the-art methods.
引用
收藏
相关论文
共 50 条
  • [1] Integrating Vision Transformer-Based Bilinear Pooling and Attention Network Fusion of RGB and Skeleton Features for Human Action Recognition
    Sun, Yaohui
    Xu, Weiyao
    Yu, Xiaoyi
    Gao, Ju
    Xia, Ting
    INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE SYSTEMS, 2023, 16 (01)
  • [2] VT-BPAN: vision transformer-based bilinear pooling and attention network fusion of RGB and skeleton features for human action recognition
    Sun Y.
    Xu W.
    Yu X.
    Gao J.
    Multimedia Tools and Applications, 2024, 83 (29) : 73391 - 73405
  • [3] Fusion of Skeleton and RGB Features for RGB-D Human Action Recognition
    Weiyao, Xu
    Muqing, Wu
    Min, Zhao
    Ting, Xia
    IEEE SENSORS JOURNAL, 2021, 21 (17) : 19157 - 19164
  • [4] Action Recognition Based on Adaptive Fusion of RGB and Skeleton Features
    Guo Fuzheng
    Kong Jun
    Jiang Min
    LASER & OPTOELECTRONICS PROGRESS, 2020, 57 (20)
  • [5] Hybrid features for skeleton-based action recognition based on network fusion
    Chen, Zhangmeng
    Pan, Junjun
    Yang, Xiaosong
    Qin, Hong
    COMPUTER ANIMATION AND VIRTUAL WORLDS, 2020, 31 (4-5)
  • [6] Trear: Transformer-Based RGB-D Egocentric Action Recognition
    Li, Xiangyu
    Hou, Yonghong
    Wang, Pichao
    Gao, Zhimin
    Xu, Mingliang
    Li, Wanqing
    IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, 2022, 14 (01) : 246 - 252
  • [7] Graph transformer network with temporal kernel attention for skeleton-based action recognition
    Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming
    650504, China
    Knowl Based Syst,
  • [8] Graph transformer network with temporal kernel attention for skeleton-based action recognition
    Liu, Yanan
    Zhang, Hao
    Xu, Dan
    He, Kangjian
    KNOWLEDGE-BASED SYSTEMS, 2022, 240
  • [9] Human Action Recognition Based on Skeleton Features
    Gao, Yi
    Wu, Haitao
    Wu, Xinmeng
    Li, Zilin
    Zhao, Xiaofan
    COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2023, 20 (01) : 537 - 550
  • [10] Ship target recognition based on low rank bilinear pooling attention network
    Guan X.
    Guo J.
    Yi X.
    Xi Tong Gong Cheng Yu Dian Zi Ji Shu/Systems Engineering and Electronics, 2023, 45 (05): : 1305 - 1314