Integration of audio-visual information for multi-speaker multimedia speaker recognition

被引:0
|
作者
Yang, Jichen [1 ]
Chen, Fangfan [1 ]
Cheng, Yu [2 ]
Lin, Pei [3 ]
机构
[1] Guangdong Polytech Normal Univ, Sch Cyber Secur, Guangzhou, Peoples R China
[2] Natl Univ Singapore, Dept Elect & Comp Engn, Singapore, Singapore
[3] Guangdong Polytech Normal Univ, Sch Elect & Informat, Guangzhou, Peoples R China
关键词
Multi-speaker multimedia speaker recognition; Audio information; Visual information; FACE RECOGNITION; MODEL; DIARIZATION; TRACKING; FEATURES;
D O I
10.1016/j.dsp.2023.104315
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Recently, multi-speaker multimedia speaker recognition (MMSR) has garnered significant attention. Although prior research primarily focused on the back-end score level fusion of audio and visual information, this study delves into innovative techniques for integrating audio and visual cues from the front-end representations of both speaker's voice and face. The first method introduces the use of visual information to estimate the number of speakers. This solution addresses the challenges of estimating speaker numbers in multi-speaker conversations, especially in noisy environments. Subsequently, agglomerative hierarchical clustering is employed for speaker diarization, proving beneficial for MMSR. This approach is termed video aiding audio fusion (VAAF). The second method innovates by introducing a ratio factor to create a multimedia vector (M-vector) which concatenates face embeddings with x-vector. This amalgamation encapsulates both audio and visual cues. The resulting M vector is then leveraged for MMSR. We name this method as video interacting audio fusion (VIAF). Experimental results on the NIST SRE 2019 audio-visual corpus reveal that the VAAF-based MMSR achieves a 6.94% and 8.31% relative reduction in minDCF and actDCF, respectively, when benchmarked against zero-effort systems. Additionally, the VIAF-based MMSR realizes a 12.08% and 12.99% relative reduction in minDCF and actDCF, respectively, compared to systems that solely utilize face embeddings. Notably, when combining both methods, the minDCF and actDCF metrics are further optimized, reaching 0.098 and 0.102, respectively.
引用
收藏
页数:10
相关论文
共 50 条
  • [21] A hybrid approach to speaker recognition in multi-speaker environment
    Trivedi, J
    Maitra, A
    Mitra, SK
    PATTERN RECOGNITION AND MACHINE INTELLIGENCE, PROCEEDINGS, 2005, 3776 : 272 - 275
  • [22] Multi-Speaker Meeting Audio Segmentation
    Nwe, Tin Lay
    Dong, Minghui
    Khine, Swe Zin Kalayar
    Li, Haizhou
    INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 2522 - 2525
  • [23] The 'Audio-Visual Face Cover Corpus': Investigations into audio-visual speech and speaker recognition when the speaker's face is occluded by facewear
    Fecher, Natalie
    13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 2247 - 2250
  • [24] Speaker Diarization based on Audio-Visual Integration for Smart Posterboard
    Wakabayashi, Yukoh
    Inoue, Koji
    Yoshimoto, Hiromasa
    Kawahara, Tatsuya
    2014 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2014,
  • [25] Probabilistic speaker localization in noisy enviromments by audio-visual integration
    Choi, Jong-Suk
    Kim, Munsang
    Kim, Hyun-Don
    2006 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, VOLS 1-12, 2006, : 4704 - +
  • [26] An audio-visual approach to simultaneous-speaker speech recognition
    Patterson, EK
    Gowdy, JN
    2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL V, PROCEEDINGS: SENSOR ARRAY & MULTICHANNEL SIGNAL PROCESSING AUDIO AND ELECTROACOUSTICS MULTIMEDIA SIGNAL PROCESSING, 2003, : 780 - 783
  • [27] A multi-channel/multi-speaker interactive 3D Audio-Visual Speech Corpus in Mandarin
    Yu, Jun
    Su, Rongfeng
    Wang, Lan
    Zhou, Wenpeng
    2016 10TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2016,
  • [28] Audio-Visual Synchronisation for Speaker Diarisation
    Garau, Giulia
    Dielmann, Alfred
    Bourlard, Herve
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2662 - +
  • [29] Training Speaker Embedding Extractors Using Multi-Speaker Audio with Unknown Speaker Boundaries
    Stafylakis, Themos
    Mosner, Ladislav
    Plchot, Oldrich
    Rohdin, Johan
    Silnova, Anna
    Burget, Lukas
    Cernocky, Jan Honza
    INTERSPEECH 2022, 2022, : 605 - 609
  • [30] Audio Visual Multi-Speaker Tracking with Improved GCF and PMBM Filter
    Zhao, Jinzheng
    Wu, Peipei
    Liu, Xubo
    Goudarzi, Shidrokh
    Liu, Haohe
    Xu, Yong
    Wang, Wenwu
    INTERSPEECH 2022, 2022, : 3704 - 3708