Integration of audio-visual information for multi-speaker multimedia speaker recognition

被引:0
|
作者
Yang, Jichen [1 ]
Chen, Fangfan [1 ]
Cheng, Yu [2 ]
Lin, Pei [3 ]
机构
[1] Guangdong Polytech Normal Univ, Sch Cyber Secur, Guangzhou, Peoples R China
[2] Natl Univ Singapore, Dept Elect & Comp Engn, Singapore, Singapore
[3] Guangdong Polytech Normal Univ, Sch Elect & Informat, Guangzhou, Peoples R China
关键词
Multi-speaker multimedia speaker recognition; Audio information; Visual information; FACE RECOGNITION; MODEL; DIARIZATION; TRACKING; FEATURES;
D O I
10.1016/j.dsp.2023.104315
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Recently, multi-speaker multimedia speaker recognition (MMSR) has garnered significant attention. Although prior research primarily focused on the back-end score level fusion of audio and visual information, this study delves into innovative techniques for integrating audio and visual cues from the front-end representations of both speaker's voice and face. The first method introduces the use of visual information to estimate the number of speakers. This solution addresses the challenges of estimating speaker numbers in multi-speaker conversations, especially in noisy environments. Subsequently, agglomerative hierarchical clustering is employed for speaker diarization, proving beneficial for MMSR. This approach is termed video aiding audio fusion (VAAF). The second method innovates by introducing a ratio factor to create a multimedia vector (M-vector) which concatenates face embeddings with x-vector. This amalgamation encapsulates both audio and visual cues. The resulting M vector is then leveraged for MMSR. We name this method as video interacting audio fusion (VIAF). Experimental results on the NIST SRE 2019 audio-visual corpus reveal that the VAAF-based MMSR achieves a 6.94% and 8.31% relative reduction in minDCF and actDCF, respectively, when benchmarked against zero-effort systems. Additionally, the VIAF-based MMSR realizes a 12.08% and 12.99% relative reduction in minDCF and actDCF, respectively, compared to systems that solely utilize face embeddings. Notably, when combining both methods, the minDCF and actDCF metrics are further optimized, reaching 0.098 and 0.102, respectively.
引用
下载
收藏
页数:10
相关论文
共 50 条
  • [1] Multi-Speaker Audio-Visual Corpus RUSAVIC: Russian Audio-Visual Speech in Cars
    Ivanko, Denis
    Ryumin, Dmitry
    Axyonov, Alexandr
    Kashevnik, Alexey
    Karpov, Alexey
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1555 - 1559
  • [2] Audio-Visual Multi-Speaker Tracking Based On the GLMB Framework
    Lin, Shoufeng
    Qian, Xinyuan
    INTERSPEECH 2020, 2020, : 3082 - 3086
  • [3] ACCOUNTING FOR ROOM ACOUSTICS IN AUDIO-VISUAL MULTI-SPEAKER TRACKING
    Ban, Yutong
    Li, Xiaofei
    Alameda-Pineda, Xavier
    Girin, Laurent
    Horaud, Radu
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 6553 - 6557
  • [4] Multi-Speaker Tracking From an Audio-Visual Sensing Device
    Qian, Xinyuan
    Brutti, Alessio
    Lanz, Oswald
    Omologo, Maurizio
    Cavallaro, Andrea
    IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (10) : 2576 - 2588
  • [5] Speaker independent audio-visual speech recognition
    Zhang, Y
    Levinson, S
    Huang, T
    2000 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, PROCEEDINGS VOLS I-III, 2000, : 1073 - 1076
  • [6] Multifactor fusion for audio-visual speaker recognition
    Chetty, Girija
    Tran, Dat
    LECTURE NOTES IN SIGNAL SCIENCE, INTERNET AND EDUCATION (SSIP'07/MIV'07/DIWEB'07), 2007, : 70 - +
  • [7] Audio-visual system for robust speaker recognition
    Chen, Q
    Yang, JG
    Gou, J
    MLMTA '05: Proceedings of the International Conference on Machine Learning Models Technologies and Applications, 2005, : 97 - 103
  • [8] Particle Flow SMC-PHD Filter for Audio-Visual Multi-speaker Tracking
    Liu, Yang
    Wang, Wenwu
    Chambers, Jonathon
    Kilic, Volkan
    Hilton, Adrian
    LATENT VARIABLE ANALYSIS AND SIGNAL SEPARATION (LVA/ICA 2017), 2017, 10169 : 344 - 353
  • [9] Audio-Visual Particle Flow SMC-PHD Filtering for Multi-Speaker Tracking
    Liu, Yang
    Kilic, Volkan
    Guan, Jian
    Wang, Wenwu
    IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (04) : 934 - 948
  • [10] MULTI-SPEAKER TRACKING BY FUSING AUDIO AND VIDEO INFORMATION
    Xiong, Zichao
    Liu, Hongqing
    Zhou, Yi
    Luo, Zhen
    2021 IEEE STATISTICAL SIGNAL PROCESSING WORKSHOP (SSP), 2021, : 321 - 325