Integration of audio-visual information for multi-speaker multimedia speaker recognition

被引:0
|
作者
Yang, Jichen [1 ]
Chen, Fangfan [1 ]
Cheng, Yu [2 ]
Lin, Pei [3 ]
机构
[1] Guangdong Polytech Normal Univ, Sch Cyber Secur, Guangzhou, Peoples R China
[2] Natl Univ Singapore, Dept Elect & Comp Engn, Singapore, Singapore
[3] Guangdong Polytech Normal Univ, Sch Elect & Informat, Guangzhou, Peoples R China
关键词
Multi-speaker multimedia speaker recognition; Audio information; Visual information; FACE RECOGNITION; MODEL; DIARIZATION; TRACKING; FEATURES;
D O I
10.1016/j.dsp.2023.104315
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Recently, multi-speaker multimedia speaker recognition (MMSR) has garnered significant attention. Although prior research primarily focused on the back-end score level fusion of audio and visual information, this study delves into innovative techniques for integrating audio and visual cues from the front-end representations of both speaker's voice and face. The first method introduces the use of visual information to estimate the number of speakers. This solution addresses the challenges of estimating speaker numbers in multi-speaker conversations, especially in noisy environments. Subsequently, agglomerative hierarchical clustering is employed for speaker diarization, proving beneficial for MMSR. This approach is termed video aiding audio fusion (VAAF). The second method innovates by introducing a ratio factor to create a multimedia vector (M-vector) which concatenates face embeddings with x-vector. This amalgamation encapsulates both audio and visual cues. The resulting M vector is then leveraged for MMSR. We name this method as video interacting audio fusion (VIAF). Experimental results on the NIST SRE 2019 audio-visual corpus reveal that the VAAF-based MMSR achieves a 6.94% and 8.31% relative reduction in minDCF and actDCF, respectively, when benchmarked against zero-effort systems. Additionally, the VIAF-based MMSR realizes a 12.08% and 12.99% relative reduction in minDCF and actDCF, respectively, compared to systems that solely utilize face embeddings. Notably, when combining both methods, the minDCF and actDCF metrics are further optimized, reaching 0.098 and 0.102, respectively.
引用
收藏
页数:10
相关论文
共 50 条
  • [31] Speaker Localization among multi-faces in noisy environment by audio-visual Integration
    Kim, Hyun-Don
    Choi, Jong-Suk
    Kim, Munsang
    2006 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), VOLS 1-10, 2006, : 1305 - 1310
  • [32] Speaker position detection system using audio-visual information
    Matsuo, N
    Kitagawa, H
    Nagata, S
    FUJITSU SCIENTIFIC & TECHNICAL JOURNAL, 1999, 35 (02): : 212 - 220
  • [33] Audio-visual Speaker Recognition via Multi-modal Correlated Neural Networks
    Geng, Jiajia
    Liu, Xin
    Cheung, Yiu-ming
    2016 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE WORKSHOPS (WIW 2016), 2016, : 123 - 128
  • [34] A MULTI-VIEW APPROACH TO AUDIO-VISUAL SPEAKER VERIFICATION
    Sari, Leda
    Singh, Kritika
    Zhou, Jiatong
    Torresani, Lorenzo
    Singhal, Nayan
    Saraf, Yatharth
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6194 - 6198
  • [35] Integration strategies for audio-visual speech processing: Applied to text-dependent speaker recognition
    Lucey, S
    Chen, TH
    Sridharan, S
    Chandran, V
    IEEE TRANSACTIONS ON MULTIMEDIA, 2005, 7 (03) : 495 - 506
  • [36] AVA ACTIVE SPEAKER: AN AUDIO-VISUAL DATASET FOR ACTIVE SPEAKER DETECTION
    Roth, Joseph
    Chaudhuri, Sourish
    Klejch, Ondrej
    Marvin, Radhika
    Gallagher, Andrew
    Kaver, Liat
    Ramaswamy, Sharadh
    Stopczynski, Arkadiusz
    Schmid, Cordelia
    Xi, Zhonghua
    Pantofaru, Caroline
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4492 - 4496
  • [37] Speaker detection using multi-speaker audio files for both enrollment and test
    Bonastre, JF
    Meignier, S
    Merlin, T
    2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL II, PROCEEDINGS: SPEECH II; INDUSTRY TECHNOLOGY TRACKS; DESIGN & IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS; NEURAL NETWORKS FOR SIGNAL PROCESSING, 2003, : 77 - 80
  • [38] Audio-visual Speaker Recognition with a Cross-modal Discriminative Network
    Tao, Ruijie
    Das, Rohan Kumar
    Li, Haizhou
    INTERSPEECH 2020, 2020, : 2242 - 2246
  • [39] Dynamic visual features for audio-visual speaker verification
    Dean, David
    Sridharan, Sridha
    COMPUTER SPEECH AND LANGUAGE, 2010, 24 (02): : 136 - 149
  • [40] MULTI-SPEAKER CONVERSATIONS, CROSS-TALK, AND DIARIZATION FOR SPEAKER RECOGNITION
    Sell, Gregory
    McCree, Alan
    2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 5425 - 5429