Multimodal Speaker Diarization

被引:38
|
作者
Noulas, Athanasios [1 ]
Englebienne, Gwenn [1 ]
Krose, Ben J. A. [1 ]
机构
[1] Univ Amsterdam, Amsterdam, Netherlands
关键词
Speaker diarization; dynamic Bayesian networks; audiovisual fusion;
D O I
10.1109/TPAMI.2011.47
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a novel probabilistic framework that fuses information coming from the audio and video modality to perform speaker diarization. The proposed framework is a Dynamic Bayesian Network (DBN) that is an extension of a factorial Hidden Markov Model (fHMM) and models the people appearing in an audiovisual recording as multimodal entities that generate observations in the audio stream, the video stream, and the joint audiovisual space. The framework is very robust to different contexts, makes no assumptions about the location of the recording equipment, and does not require labeled training data as it acquires the model parameters using the Expectation Maximization (EM) algorithm. We apply the proposed model to two meeting videos and a news broadcast video, all of which come from publicly available data sets. The results acquired in speaker diarization are in favor of the proposed multimodal framework, which outperforms the single modality analysis results and improves over the state-of-the-art audio-based speaker diarization.
引用
收藏
页码:79 / 93
页数:15
相关论文
共 50 条
  • [31] A review on speaker diarization systems and approaches
    Moattar, M. H.
    Homayounpour, M. M.
    [J]. SPEECH COMMUNICATION, 2012, 54 (10) : 1065 - 1103
  • [32] Acoustic beamforming for speaker diarization of meetings
    Anguera, Xavier
    Wooters, Chuck
    Hernando, Javier
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (07): : 2011 - 2022
  • [33] Speaker Diarization: A Review of Recent Research
    Anguera Miro, Xavier
    Bozonnet, Simon
    Evans, Nicholas
    Fredouille, Corinne
    Friedland, Gerald
    Vinyals, Oriol
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (02): : 356 - 370
  • [34] Speaker diarization of French broadcast news
    Gupta, Vishwa
    Boulianne, Gilles
    Kenny, Patrick
    Ouellet, Pierre
    Dumouchel, Pierre
    [J]. 2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, : 4365 - 4368
  • [35] A Hybrid Approach to Online Speaker Diarization
    Vaquero, Carlos
    Vinyals, Oriol
    Friedland, Gerald
    [J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2646 - +
  • [36] AUDIOVISUAL SPEAKER DIARIZATION OF TV SERIES
    Bost, Xavier
    Linares, Georges
    Gueye, Serigne
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 4799 - 4803
  • [37] SPEAKER DIARIZATION WITH REGION PROPOSAL NETWORK
    Huang, Zili
    Watanabe, Shinji
    Fujita, Yusuke
    Garcia, Paola
    Shao, Yiwen
    Povey, Daniel
    Khudanpur, Sanjeev
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6514 - 6518
  • [38] Robust Speaker Diarization for News Broadcast
    Karthik, M. L. N. S.
    Ganesh, Mirishkar Sai
    Patnaik, Bijayananda
    [J]. 2018 INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, SIGNAL PROCESSING AND NETWORKING (WISPNET), 2018,
  • [39] IMPROVED SPEAKER DIARIZATION SYSTEM FOR MEETINGS
    El-Khoury, Elie
    Senac, Christine
    Pinquier, Julien
    [J]. 2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 4097 - 4100
  • [40] On the Use of Dot Scoring for Speaker Diarization
    Diez, Mireia
    Penagarikano, Mikel
    Varona, Amparo
    Javier Rodriguez-Fuentes, Luis
    Bordel, German
    [J]. PATTERN RECOGNITION AND IMAGE ANALYSIS: 5TH IBERIAN CONFERENCE, IBPRIA 2011, 2011, 6669 : 612 - 619