Audio-Visual Cross-Attention Network for Robotic Speaker Tracking

被引：5

作者：

Qian, Xinyuan ^{[1
,2
,3
]}

Wang, Zhengdong ^{[4
]}

Wang, Jiadong ^{[4
]}

Guan, Guohui ^{[5
]}

Li, Haizhou ^{[3
,4
,6
,7
]}

机构：

[1] Univ Sci & Technol Beijing, Dept Comp Sci & Technol, Beijing 100083, Peoples R China

[2] Chinese Univ Hong Kong, Shenzhen 518172, Peoples R China

[3] Shenzhen Res Inst Big data, Shenzhen 51872, Peoples R China

[4] Natl Univ Singapore, Dept Elect & Comp Engn, Singapore 119077, Singapore

[5] Univ Calif Berkeley, Dept Elect Engn & Comp Sci, Berkeley, CA 94702 USA

[6] Chinese Univ Hong Kong, Guangdong Prov Key Lab Big Data Comp, Shenzhen 518172, Peoples R China

[7] Univ Bremen, D-28359 Bremen, Germany

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2023年 / 31卷

关键词：

Speaker tracking; direction-of-arrival; audio-visual fusion; cross-modal attention; NEURAL-NETWORKS; LOCALIZATION; NOISE;

D O I：

10.1109/TASLP.2022.3226330

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Audio-visual signals can be used jointly for robotic perception as they complement each other. Such multi-modal sensory fusion has a clear advantage, especially under noisy acoustic conditions. Speaker localization, as an essential robotic function, was traditionally solved as a signal processing problem that now increasingly finds deep learning solutions. The question is how to fuse audio-visual signals in an effective way. Speaker tracking is not only more desirable, but also potentially more accurate than speaker localization because it explores the speaker's temporal motion dynamics for smoothed trajectory estimation. However, due to the lack of large annotated dataset, speaker tracking is not well studied as speaker localization. In this paper, we study robotic speaker Direction of Arrival (DoA) estimation with a focus on audio-visual fusion and tracking methodology. We propose a Cross-Modal Attentive Fusion (CMAF) mechanism, which explores self-attention to learn intra-modal temporal dependencies, and cross-attention mechanism for inter-modal alignment. We also collect a realistic dataset on a robotic platform to support the study. The experimental results demonstrate that our proposed network outperforms the state-of-the-art audio-visual localization and tracking methods under noisy conditions, with an improved accuracy of 5.82% and 3.62% at SNR = -20 dB, respectively.

引用

页码：550 / 562

页数：13

共 50 条

[1] Audio-Visual Speaker Verification via Joint Cross-Attention
Rajasekhar, Gnana Praveen
Alam, Jahangir
[J]. SPEECH AND COMPUTER, SPECOM 2023, PT II, 2023, 14339 : 18 - 31
[2] Multi-scale network with shared cross-attention for audio-visual correlation learning
Zhang, Jiwei
Yu, Yi
Tang, Suhua
Li, Wei
Wu, Jianming
[J]. NEURAL COMPUTING & APPLICATIONS, 2023, 35 (27): : 20173 - 20187
[3] CASA-Net: Cross-attention and Self-attention for End-to-End Audio-visual Speaker Diarization
Zhou, Haodong
Li, Tao
Wang, Jie
Li, Lin
Hong, Qingyang
[J]. 2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 102 - 106
[4] A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition
Praveen, R. Gnana
de Melo, Wheidima Carneiro
Ullah, Nasib
Aslam, Haseeb
Zeeshan, Osama
Denorme, Theo
Pedersoli, Marco
Koerich, Alessandro L.
Bacon, Simon
Cardinal, Patrick
Granger, Eric
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 2485 - 2494
[5] Audio-visual Speaker Recognition with a Cross-modal Discriminative Network
Tao, Ruijie
Das, Rohan Kumar
Li, Haizhou
[J]. INTERSPEECH 2020, 2020, : 2242 - 2246
[6] Multi-Modal Perception Attention Network with Self-Supervised Learning for Audio-Visual Speaker Tracking
Li, Yidi
Liu, Hong
Tang, Hao
[J]. THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 1456 - 1463
[7] Audio-visual speaker tracking with importance particle filters
Gatica-Perez, D
Lathoud, G
McCowan, I
Odobez, JM
Moore, D
[J]. 2003 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL 3, PROCEEDINGS, 2003, : 25 - 28
[8] Audio-Visual Salieny Network with Audio Attention Module
Cheng, Shuaiyang
Gao, Xing
Song, Liang
Xiahou, Jianbing
[J]. PROCEEDINGS OF 2021 2ND INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND INFORMATION SYSTEMS (ICAIIS '21), 2021,
[9] Neural Speaker Extraction with Speaker-Speech Cross-Attention Network
Wang, Wupeng
Xu, Chenglin
Ge, Meng
Li, Haizhou
[J]. INTERSPEECH 2021, 2021, : 3535 - 3539
[10] Speaker Tracking Based on Audio-Visual Fusion with Unknown Noise
Cao, Jie
Li, Jun
Li, Wei
[J]. PROCEEDINGS OF 2013 CHINESE INTELLIGENT AUTOMATION CONFERENCE: INTELLIGENT INFORMATION PROCESSING, 2013, 256 : 215 - 226

← 1 2 3 4 5 →