NOISE-TOLERANT AUDIO-VISUAL ONLINE PERSON VERIFICATION USING AN ATTENTION-BASED NEURAL NETWORK FUSION

被引：0

作者：

Shon, Suwon ^{[1
]}

Oh, Tae-Hyun ^{[1
]}

Glass, James ^{[1
]}

机构：

[1] MIT, Comp Sci & Artificial Intelligence Lab, 77 Massachusetts Ave, Cambridge, MA 02139 USA

来源：

2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2019年

关键词：

person verification; recognition; multi-modal; cross-modal; attention; missing data; RECOGNITION;

D O I：

暂无

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

In this paper, we present a multi-modal online person verification system using both speech and visual signals. Inspired by neuroscientific findings on the association of voice and face, we propose an attention-based end-to-end neural network that learns multi-sensory association for the task of person verification. The attention mechanism in our proposed network learns to conditionally select a salient modality between speech and facial representations that provides a balance between complementary inputs. By virtue of this capability, the network is robust to missing or corrupted data from either modality. In the VoxCeleb2 dataset, we show that our method performs favorably against competing multi-modal methods. Even for extreme cases of large corruption or missing data on either modality, our method demonstrates robustness over other unimodal methods.

引用

页码：3995 / 3999

页数：5

共 50 条

[1] Audio-Visual Fusion Based on Interactive Attention for Person Verification
Jing, Xuebin
He, Liang
Song, Zhida
Wang, Shaolei
[J]. SENSORS, 2023, 23 (24)
[2] Attention-Based Audio-Visual Fusion for Video Summarization
Fang, Yinghong
Zhang, Junpeng
Lu, Cewu
[J]. NEURAL INFORMATION PROCESSING (ICONIP 2019), PT II, 2019, 11954 : 328 - 340
[3] Noise-Tolerant Learning for Audio-Visual Action Recognition
Han, Haochen
Zheng, Qinghua
Luo, Minnan
Miao, Kaiyao
Tian, Feng
Chen, Yan
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 7761 - 7774
[4] Audio-Visual Deep Neural Network for Robust Person Verification
Qian, Yanmin
Chen, Zhengyang
Wang, Shuai
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 1079 - 1092
[5] Attention Fusion for Audio-Visual Person Verification Using Multi-Scale Features
Hoermann, Stefan
Moiz, Abdul
Knoche, Martin
Rigoll, Gerhard
[J]. 2020 15TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION (FG 2020), 2020, : 281 - 285
[6] Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition
Sterpu, George
Saam, Christian
Harte, Naomi
[J]. ICMI'18: PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2018, : 111 - 115
[7] A Deep Neural Network for Audio-Visual Person Recognition
Alam, Mohammad Rafiqul
Bennamoun, Mohammed
Togneri, Roberto
Sohel, Ferdous
[J]. 2015 IEEE 7TH INTERNATIONAL CONFERENCE ON BIOMETRICS THEORY, APPLICATIONS AND SYSTEMS (BTAS 2015), 2015,
[8] Multi-Attention Audio-Visual Fusion Network for Audio Spatialization
Zhang, Wen
Shao, Jie
[J]. PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 394 - 401
[9] Noise-Tolerant Self-Supervised Learning for Audio-Visual Voice Activity Detection
Kim, Ui-Hyun
[J]. INTERSPEECH 2021, 2021, : 326 - 330
[10] Fuzzy-Neural-Network Based Audio-Visual Fusion for Speech Recognition
Wu, Gin-Der
Tsai, Hao-Shu
[J]. 2019 1ST INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE IN INFORMATION AND COMMUNICATION (ICAIIC 2019), 2019, : 210 - 214

← 1 2 3 4 5 →