Audio-Visual Transformer Based Crowd Counting

被引:16
|
作者
Sajid, Usman [1 ]
Chen, Xiangyu [1 ]
Sajid, Hasan [2 ]
Kim, Taejoon [1 ]
Wang, Guanghui [3 ]
机构
[1] Univ Kansas, Elect Engn & Comp Sci, Lawrence, KS 66045 USA
[2] NUST, Sch Mech & Mfg Engn, Islamabad, Pakistan
[3] Ryerson Univ, Dept Comp Sci, Toronto, ON M5B 2K3, Canada
关键词
NETWORK;
D O I
10.1109/ICCVW54120.2021.00254
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Crowd estimation is a very challenging problem. The most recent study tries to exploit auditory information to aid the visual models, however, the performance is limited due to the lack of an effective approach for feature extraction and integration. The paper proposes a new audiovisual multi-task network to address the critical challenges in crowd counting by effectively utilizing both visual and audio inputs for better modalities association and productive feature extraction. The proposed network introduces the notion of auxiliary and explicit image patch-importance ranking (PIR) and patch-wise crowd estimate (PCE) information to produce a third (run-time) modality. These modalities (audio, visual, run-time) undergo a transformer-inspired cross-modality co-attention mechanism to finally output the crowd estimate. To acquire rich visual features, we propose a multi-branch structure with transformer-style fusion in-between. Extensive experimental evaluations show that the proposed scheme outperforms the state-of-the-art networks under all evaluation settings with up to 33.8% improvement. We also analyze and compare the vision-only variant of our network and empirically demonstrate its superiority over previous approaches.
引用
收藏
页码:2249 / 2259
页数:11
相关论文
共 50 条
  • [1] AVSegFormer: Audio-Visual Segmentation with Transformer
    Gao, Shengyi
    Chen, Zhe
    Chen, Guo
    Wang, Wenhai
    Lu, Tong
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 11, 2024, : 12155 - 12163
  • [2] The Right to Talk: An Audio-Visual Transformer Approach
    Thanh-Dat Truong
    Chi Nhan Duong
    The De Vu
    Hoang Anh Pham
    Raj, Bhiksha
    Le, Ngan
    Khoa Luu
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1085 - 1094
  • [3] Effect of Audio-Visual Factors in the Evaluation of Crowd Noise
    Yang, Xiaoyin
    Kang, Jian
    [J]. APPLIED SCIENCES-BASEL, 2023, 13 (06):
  • [4] AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions
    Hu, Ruihan
    Mo, Qinglong
    Xie, Yuanfei
    Xu, Yongqian
    Chen, Jiaqi
    Yang, Yalun
    Zhou, Hongjian
    Tang, Zhi-Ri
    Wu, Edmond Q.
    [J]. IEEE ACCESS, 2021, 9 : 80500 - 80510
  • [5] AVMSN: An Audio-Visual Two Stream Crowd Counting Framework under Low-Quality Conditions
    Hu, Ruihan
    Mo, Qinglong
    Xie, Yuanfei
    Xu, Yongqian
    Chen, Jiaqi
    Yang, Yalun
    Zhou, Hongjian
    Tang, Zhi-Ri
    Wu, Edmond Q.
    [J]. IEEE Access, 2021, 9 : 80500 - 80510
  • [6] Audio-visual event detection based on mining of semantic audio-visual labels
    Goh, KS
    Miyahara, K
    Radhakrishan, R
    Xiong, ZY
    Divakaran, A
    [J]. STORAGE AND RETRIEVAL METHODS AND APPLICATIONS FOR MULTIMEDIA 2004, 2004, 5307 : 292 - 299
  • [7] The Problems and Challenges of Managing Crowd Sourced Audio-Visual Evidence
    Lallie, Harjinder Singh
    [J]. FUTURE INTERNET, 2014, 6 (02): : 190 - 202
  • [8] Audio-Visual Action Recognition Using Transformer Fusion Network
    Kim, Jun-Hwa
    Won, Chee Sun
    [J]. APPLIED SCIENCES-BASEL, 2024, 14 (03):
  • [9] A PRE-TRAINED AUDIO-VISUAL TRANSFORMER FOR EMOTION RECOGNITION
    Minh Tran
    Soleymani, Mohammad
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4698 - 4702
  • [10] Multimodal Sparse Transformer Network for Audio-Visual Speech Recognition
    Song, Qiya
    Sun, Bin
    Li, Shutao
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (12) : 10028 - 10038