Audio-Visual Deep Neural Network for Robust Person Verification

被引:22
|
作者
Qian, Yanmin [1 ,2 ]
Chen, Zhengyang [1 ,2 ]
Wang, Shuai [1 ,2 ]
机构
[1] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, SpeechLab, Shanghai 200240, Peoples R China
[2] Shanghai Jiao Tong Univ, MoE Key Lab Artificial Intelligence, AI Inst, Shanghai 200240, Peoples R China
关键词
Faces; Face recognition; Visualization; Neural networks; Noise measurement; Robustness; Feature extraction; Audio-visual deep neural network; data augmentation; multi-modal system; person verification;
D O I
10.1109/TASLP.2021.3057230
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Voice and face are two most popular biometrics for person verification, usually used in speaker verification and face verification tasks. It has already been observed that simply combining the information from these two modalities can lead to a more powerful and robust person verification system. In this article, to fully explore the multi-modal learning strategies for person verification, we proposed three types of audio-visual deep neural network (AVN), including feature level AVN (AVN-F), embedding level AVN (AVN-E), and embedding level combination with joint learning AVN (AVN-J). To further enhance the system robustness in real noisy conditions where not both modalities can be accessed with high-quality, we proposed several data augmentation strategies for each proposed AVN: A feature-level multi-modal data augmentation is proposed for AVN-F and an embedding-level data augmentation with novel noise distribution matching is designed for AVN-E. For AVN-J, both the feature and embedding level multi-modal data augmentation methods can be applied. All the proposed models are trained on the VoxCeleb2 dev dataset and evaluated on the standard VoxCeleb1 dataset, and the best system achieves 0.558, 0.441% and 0.793% EER on the three official trial lists of VoxCeleb1, which is to our knowledge the best published single system results on this corpus for person verification. To validate the robustness of the proposed approaches, a noisy evaluation set based on the VoxCeleb1 is constructed, and experimental results show that the proposed system can significantly boost the system robustness and still show promising performance under this noisy scenario.
引用
收藏
页码:1079 / 1092
页数:14
相关论文
共 50 条
  • [1] A Deep Neural Network for Audio-Visual Person Recognition
    Alam, Mohammad Rafiqul
    Bennamoun, Mohammed
    Togneri, Roberto
    Sohel, Ferdous
    [J]. 2015 IEEE 7TH INTERNATIONAL CONFERENCE ON BIOMETRICS THEORY, APPLICATIONS AND SYSTEMS (BTAS 2015), 2015,
  • [2] Scalability analysis of audio-visual person identity verification
    Czyz, J
    Bengio, S
    Marcel, C
    Vandendorpe, L
    [J]. AUDIO-AND VIDEO-BASED BIOMETRIC PERSON AUTHENTICATION, PROCEEDINGS, 2003, 2688 : 752 - 760
  • [3] DEEP AUDIO-VISUAL FUSION NEURAL NETWORK FOR SALIENCY ESTIMATION
    Yao, Shunyu
    Min, Xiongkuo
    Zhai, Guangtao
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 1604 - 1608
  • [4] Audio-Visual Fusion Based on Interactive Attention for Person Verification
    Jing, Xuebin
    He, Liang
    Song, Zhida
    Wang, Shaolei
    [J]. SENSORS, 2023, 23 (24)
  • [5] Multimodal Deep Convolutional Neural Network for Audio-Visual Emotion Recognition
    Zhang, Shiqing
    Zhang, Shiliang
    Huang, Tiejun
    Gao, Wen
    [J]. ICMR'16: PROCEEDINGS OF THE 2016 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2016, : 281 - 284
  • [6] NOISE-TOLERANT AUDIO-VISUAL ONLINE PERSON VERIFICATION USING AN ATTENTION-BASED NEURAL NETWORK FUSION
    Shon, Suwon
    Oh, Tae-Hyun
    Glass, James
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 3995 - 3999
  • [7] Audio-Visual (Multimodal) Speech Recognition System Using Deep Neural Network
    Paulin, Hebsibah
    Milton, R. S.
    JanakiRaman, S.
    Chandraprabha, K.
    [J]. JOURNAL OF TESTING AND EVALUATION, 2019, 47 (06) : 3963 - 3974
  • [8] AUDIO-VISUAL DEEP LEARNING FOR NOISE ROBUST SPEECH RECOGNITION
    Huang, Jing
    Kingsbury, Brian
    [J]. 2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 7596 - 7599
  • [9] Audio-Visual Kinship Verification in the Wild
    Wu, Xiaoting
    Granger, Eric
    Kinnunen, Tomi
    Feng, Xiaoyi
    Hadid, Abdenour
    [J]. 2019 INTERNATIONAL CONFERENCE ON BIOMETRICS (ICB), 2019,
  • [10] Audio-Visual Sensor Fusion Framework Using Person Attributes Robust to Missing Visual Modality for Person Recognition
    John, Vijay
    Kawanishi, Yasutomo
    [J]. MULTIMEDIA MODELING, MMM 2023, PT II, 2023, 13834 : 523 - 535