Multimodal Speaker Diarization

被引:38
|
作者
Noulas, Athanasios [1 ]
Englebienne, Gwenn [1 ]
Krose, Ben J. A. [1 ]
机构
[1] Univ Amsterdam, Amsterdam, Netherlands
关键词
Speaker diarization; dynamic Bayesian networks; audiovisual fusion;
D O I
10.1109/TPAMI.2011.47
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a novel probabilistic framework that fuses information coming from the audio and video modality to perform speaker diarization. The proposed framework is a Dynamic Bayesian Network (DBN) that is an extension of a factorial Hidden Markov Model (fHMM) and models the people appearing in an audiovisual recording as multimodal entities that generate observations in the audio stream, the video stream, and the joint audiovisual space. The framework is very robust to different contexts, makes no assumptions about the location of the recording equipment, and does not require labeled training data as it acquires the model parameters using the Expectation Maximization (EM) algorithm. We apply the proposed model to two meeting videos and a news broadcast video, all of which come from publicly available data sets. The results acquired in speaker diarization are in favor of the proposed multimodal framework, which outperforms the single modality analysis results and improves over the state-of-the-art audio-based speaker diarization.
引用
收藏
页码:79 / 93
页数:15
相关论文
共 50 条
  • [1] Speech Enhancement for Multimodal Speaker Diarization System
    Ahmad, Rehan
    Zubair, Syed
    Alquhayz, Hani
    [J]. IEEE ACCESS, 2020, 8 : 126671 - 126680
  • [2] Multimodal Clustering with Role Induced Constraints for Speaker Diarization
    Flemotomos, Nikolaos
    Narayanan, Shrikanth
    [J]. INTERSPEECH 2022, 2022, : 5075 - 5079
  • [3] A Multimodal Approach to Speaker Diarization on TV Talk-Shows
    Vallet, Felicien
    Essid, Slim
    Carrive, Jean
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2013, 15 (03) : 509 - 520
  • [4] Multimodal Speaker Diarization Using Oriented Optical Flow Histograms
    Knox, Mary Tai
    Friedland, Gerald
    [J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, 2010, : 290 - 293
  • [5] SPEAKER DIARIZATION THROUGH SPEAKER EMBEDDINGS
    Rouvier, Mickael
    Bousquet, Pierre-Michel
    Favre, Benoit
    [J]. 2015 23RD EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2015, : 2082 - 2086
  • [6] SOFT NONNEGATIVE MATRIX CO-FACTORIZATION WITH APPLICATION TO MULTIMODAL SPEAKER DIARIZATION
    Seichepine, N.
    Essid, S.
    Fevotte, C.
    Cappe, O.
    [J]. 2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 3537 - 3541
  • [7] A MULTIMODAL APPROACH TO INITIALISATION FOR TOP-DOWN SPEAKER DIARIZATION OF TELEVISION SHOWS
    Bozonnet, Simon
    Vallet, Felicien
    Evans, Nicholas
    Essid, Slim
    Richard, Gael
    Carrive, Jean
    [J]. 18TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO-2010), 2010, : 581 - 585
  • [8] SPEAKER DIARIZATION WITH LSTM
    Wang, Quan
    Downey, Carlton
    Wan, Li
    Mansfield, Philip Andrew
    Moreno, Ignacio Lopez
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5239 - 5243
  • [9] Trainable Speaker Diarization
    Aronowitz, Hagai
    [J]. INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 2021 - 2024
  • [10] TSUP Speaker Diarization System for Conversational Short-phrase Speaker Diarization Challenge
    Pang, Bowen
    Zhao, Huan
    Zhang, Gaosheng
    Yang, Xiaoyue
    Sun, Yang
    Zhang, Li
    Wang, Qing
    Xie, Lei
    [J]. 2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2022, : 502 - 506