Audio-Visual Cross-Attention Network for Robotic Speaker Tracking

被引:5
|
作者
Qian, Xinyuan [1 ,2 ,3 ]
Wang, Zhengdong [4 ]
Wang, Jiadong [4 ]
Guan, Guohui [5 ]
Li, Haizhou [3 ,4 ,6 ,7 ]
机构
[1] Univ Sci & Technol Beijing, Dept Comp Sci & Technol, Beijing 100083, Peoples R China
[2] Chinese Univ Hong Kong, Shenzhen 518172, Peoples R China
[3] Shenzhen Res Inst Big data, Shenzhen 51872, Peoples R China
[4] Natl Univ Singapore, Dept Elect & Comp Engn, Singapore 119077, Singapore
[5] Univ Calif Berkeley, Dept Elect Engn & Comp Sci, Berkeley, CA 94702 USA
[6] Chinese Univ Hong Kong, Guangdong Prov Key Lab Big Data Comp, Shenzhen 518172, Peoples R China
[7] Univ Bremen, D-28359 Bremen, Germany
关键词
Speaker tracking; direction-of-arrival; audio-visual fusion; cross-modal attention; NEURAL-NETWORKS; LOCALIZATION; NOISE;
D O I
10.1109/TASLP.2022.3226330
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Audio-visual signals can be used jointly for robotic perception as they complement each other. Such multi-modal sensory fusion has a clear advantage, especially under noisy acoustic conditions. Speaker localization, as an essential robotic function, was traditionally solved as a signal processing problem that now increasingly finds deep learning solutions. The question is how to fuse audio-visual signals in an effective way. Speaker tracking is not only more desirable, but also potentially more accurate than speaker localization because it explores the speaker's temporal motion dynamics for smoothed trajectory estimation. However, due to the lack of large annotated dataset, speaker tracking is not well studied as speaker localization. In this paper, we study robotic speaker Direction of Arrival (DoA) estimation with a focus on audio-visual fusion and tracking methodology. We propose a Cross-Modal Attentive Fusion (CMAF) mechanism, which explores self-attention to learn intra-modal temporal dependencies, and cross-attention mechanism for inter-modal alignment. We also collect a realistic dataset on a robotic platform to support the study. The experimental results demonstrate that our proposed network outperforms the state-of-the-art audio-visual localization and tracking methods under noisy conditions, with an improved accuracy of 5.82% and 3.62% at SNR = -20 dB, respectively.
引用
收藏
页码:550 / 562
页数:13
相关论文
共 50 条
  • [1] Audio-Visual Speaker Verification via Joint Cross-Attention
    Rajasekhar, Gnana Praveen
    Alam, Jahangir
    [J]. SPEECH AND COMPUTER, SPECOM 2023, PT II, 2023, 14339 : 18 - 31
  • [2] Multi-scale network with shared cross-attention for audio-visual correlation learning
    Zhang, Jiwei
    Yu, Yi
    Tang, Suhua
    Li, Wei
    Wu, Jianming
    [J]. NEURAL COMPUTING & APPLICATIONS, 2023, 35 (27): : 20173 - 20187
  • [3] CASA-Net: Cross-attention and Self-attention for End-to-End Audio-visual Speaker Diarization
    Zhou, Haodong
    Li, Tao
    Wang, Jie
    Li, Lin
    Hong, Qingyang
    [J]. 2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 102 - 106
  • [4] A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition
    Praveen, R. Gnana
    de Melo, Wheidima Carneiro
    Ullah, Nasib
    Aslam, Haseeb
    Zeeshan, Osama
    Denorme, Theo
    Pedersoli, Marco
    Koerich, Alessandro L.
    Bacon, Simon
    Cardinal, Patrick
    Granger, Eric
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 2485 - 2494
  • [5] Audio-visual Speaker Recognition with a Cross-modal Discriminative Network
    Tao, Ruijie
    Das, Rohan Kumar
    Li, Haizhou
    [J]. INTERSPEECH 2020, 2020, : 2242 - 2246
  • [6] Multi-Modal Perception Attention Network with Self-Supervised Learning for Audio-Visual Speaker Tracking
    Li, Yidi
    Liu, Hong
    Tang, Hao
    [J]. THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 1456 - 1463
  • [7] Audio-visual speaker tracking with importance particle filters
    Gatica-Perez, D
    Lathoud, G
    McCowan, I
    Odobez, JM
    Moore, D
    [J]. 2003 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL 3, PROCEEDINGS, 2003, : 25 - 28
  • [8] Audio-Visual Salieny Network with Audio Attention Module
    Cheng, Shuaiyang
    Gao, Xing
    Song, Liang
    Xiahou, Jianbing
    [J]. PROCEEDINGS OF 2021 2ND INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND INFORMATION SYSTEMS (ICAIIS '21), 2021,
  • [9] Neural Speaker Extraction with Speaker-Speech Cross-Attention Network
    Wang, Wupeng
    Xu, Chenglin
    Ge, Meng
    Li, Haizhou
    [J]. INTERSPEECH 2021, 2021, : 3535 - 3539
  • [10] Speaker Tracking Based on Audio-Visual Fusion with Unknown Noise
    Cao, Jie
    Li, Jun
    Li, Wei
    [J]. PROCEEDINGS OF 2013 CHINESE INTELLIGENT AUTOMATION CONFERENCE: INTELLIGENT INFORMATION PROCESSING, 2013, 256 : 215 - 226