NOISE-TOLERANT AUDIO-VISUAL ONLINE PERSON VERIFICATION USING AN ATTENTION-BASED NEURAL NETWORK FUSION

被引:0
|
作者
Shon, Suwon [1 ]
Oh, Tae-Hyun [1 ]
Glass, James [1 ]
机构
[1] MIT, Comp Sci & Artificial Intelligence Lab, 77 Massachusetts Ave, Cambridge, MA 02139 USA
关键词
person verification; recognition; multi-modal; cross-modal; attention; missing data; RECOGNITION;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this paper, we present a multi-modal online person verification system using both speech and visual signals. Inspired by neuroscientific findings on the association of voice and face, we propose an attention-based end-to-end neural network that learns multi-sensory association for the task of person verification. The attention mechanism in our proposed network learns to conditionally select a salient modality between speech and facial representations that provides a balance between complementary inputs. By virtue of this capability, the network is robust to missing or corrupted data from either modality. In the VoxCeleb2 dataset, we show that our method performs favorably against competing multi-modal methods. Even for extreme cases of large corruption or missing data on either modality, our method demonstrates robustness over other unimodal methods.
引用
收藏
页码:3995 / 3999
页数:5
相关论文
共 50 条
  • [1] Audio-Visual Fusion Based on Interactive Attention for Person Verification
    Jing, Xuebin
    He, Liang
    Song, Zhida
    Wang, Shaolei
    [J]. SENSORS, 2023, 23 (24)
  • [2] Attention-Based Audio-Visual Fusion for Video Summarization
    Fang, Yinghong
    Zhang, Junpeng
    Lu, Cewu
    [J]. NEURAL INFORMATION PROCESSING (ICONIP 2019), PT II, 2019, 11954 : 328 - 340
  • [3] Noise-Tolerant Learning for Audio-Visual Action Recognition
    Han, Haochen
    Zheng, Qinghua
    Luo, Minnan
    Miao, Kaiyao
    Tian, Feng
    Chen, Yan
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 7761 - 7774
  • [4] Audio-Visual Deep Neural Network for Robust Person Verification
    Qian, Yanmin
    Chen, Zhengyang
    Wang, Shuai
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 1079 - 1092
  • [5] Attention Fusion for Audio-Visual Person Verification Using Multi-Scale Features
    Hoermann, Stefan
    Moiz, Abdul
    Knoche, Martin
    Rigoll, Gerhard
    [J]. 2020 15TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION (FG 2020), 2020, : 281 - 285
  • [6] Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition
    Sterpu, George
    Saam, Christian
    Harte, Naomi
    [J]. ICMI'18: PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2018, : 111 - 115
  • [7] A Deep Neural Network for Audio-Visual Person Recognition
    Alam, Mohammad Rafiqul
    Bennamoun, Mohammed
    Togneri, Roberto
    Sohel, Ferdous
    [J]. 2015 IEEE 7TH INTERNATIONAL CONFERENCE ON BIOMETRICS THEORY, APPLICATIONS AND SYSTEMS (BTAS 2015), 2015,
  • [8] Multi-Attention Audio-Visual Fusion Network for Audio Spatialization
    Zhang, Wen
    Shao, Jie
    [J]. PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 394 - 401
  • [9] Noise-Tolerant Self-Supervised Learning for Audio-Visual Voice Activity Detection
    Kim, Ui-Hyun
    [J]. INTERSPEECH 2021, 2021, : 326 - 330
  • [10] Fuzzy-Neural-Network Based Audio-Visual Fusion for Speech Recognition
    Wu, Gin-Der
    Tsai, Hao-Shu
    [J]. 2019 1ST INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE IN INFORMATION AND COMMUNICATION (ICAIIC 2019), 2019, : 210 - 214