Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model

被引:9
|
作者
Ahmad, Rehan [1 ]
Zubair, Syed [2 ]
Alquhayz, Hani [3 ]
Ditta, Allah [4 ]
机构
[1] Int Islamic Univ, Dept Elect Engn, Islamabad 44000, Pakistan
[2] Analyt Camp, Islamabad 44000, Pakistan
[3] Majmaah Univ, Dept Comp Sci & Informat, Coll Sci Zulfi, Al Majmaah 11952, Saudi Arabia
[4] Univ Educ, Div Sci & Technol, Lahore 54770, Pakistan
关键词
speaker diarization; SyncNet; Gaussian mixture model; diarization error rate; speech activity detection; MFCC; MEETINGS;
D O I
10.3390/s19235163
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Speaker diarization systems aim to find 'who spoke when?' in multi-speaker recordings. The dataset usually consists of meetings, TV/talk shows, telephone and multi-party interaction recordings. In this paper, we propose a novel multimodal speaker diarization technique, which finds the active speaker through audio-visual synchronization model for diarization. A pre-trained audio-visual synchronization model is used to find the synchronization between a visible person and the respective audio. For that purpose, short video segments comprised of face-only regions are acquired using a face detection technique and are then fed to the pre-trained model. This model is a two streamed network which matches audio frames with their respective visual input segments. On the basis of high confidence video segments inferred by the model, the respective audio frames are used to train Gaussian mixture model (GMM)-based clusters. This method helps in generating speaker specific clusters with high probability. We tested our approach on a popular subset of AMI meeting corpus consisting of 5.4 h of recordings for audio and 5.8 h of different set of multimodal recordings. A significant improvement is noticed with the proposed method in term of DER when compared to conventional and fully supervised audio based speaker diarization. The results of the proposed technique are very close to the complex state-of-the art multimodal diarization which shows significance of such simple yet effective technique.
引用
收藏
页数:14
相关论文
共 50 条
  • [21] Speaker position detection system using audio-visual information
    Matsuo, N
    Kitagawa, H
    Nagata, S
    FUJITSU SCIENTIFIC & TECHNICAL JOURNAL, 1999, 35 (02): : 212 - 220
  • [22] Audio-Visual Speech Synchronization Detection Using a Bimodal Linear Prediction Model
    Kumar, Kshitiz
    Navratil, Jiri
    Marcheret, Etienne
    Libal, Vit
    Ramaswamy, Ganesh
    Potamianos, Gerasimos
    2009 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPR WORKSHOPS 2009), VOLS 1 AND 2, 2009, : 670 - +
  • [23] Dynamic visual features for audio-visual speaker verification
    Dean, David
    Sridharan, Sridha
    COMPUTER SPEECH AND LANGUAGE, 2010, 24 (02): : 136 - 149
  • [24] Performance enhancement for audio-visual speaker identification using dynamic facial muscle model
    Asadpour, Vahid
    Towhidkhah, Farzad
    Homayounpour, Mohammad Mehdi
    MEDICAL & BIOLOGICAL ENGINEERING & COMPUTING, 2006, 44 (10) : 919 - 930
  • [25] Performance enhancement for audio-visual speaker identification using dynamic facial muscle model
    Vahid Asadpour
    Farzad Towhidkhah
    Mohammad Mehdi Homayounpour
    Medical and Biological Engineering and Computing, 2006, 44 : 919 - 930
  • [26] Audio-visual interaction in multimodal communication
    Chellappa, R
    Chen, TH
    Katsaggelos, A
    IEEE SIGNAL PROCESSING MAGAZINE, 1997, 14 (04) : 37 - 38
  • [27] Audio-visual integration in multimodal communication
    Chen, T
    Rao, RR
    PROCEEDINGS OF THE IEEE, 1998, 86 (05) : 837 - 852
  • [28] Multimodal SpeakerBeam: Single channel target speech extraction with audio-visual speaker clues
    Ochiai, Tsubasa
    Delcroix, Marc
    Kinoshita, Keisuke
    Ogawa, Atsunori
    Nakatani, Tomohiro
    INTERSPEECH 2019, 2019, : 2718 - 2722
  • [29] Emotion Recognition with Pre-Trained Transformers Using Multimodal Signals
    Vazquez-Rodriguez, Juan
    Lefebvre, Gregoire
    Cumin, Julien
    Crowley, James L.
    2022 10TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII), 2022,
  • [30] Uncertainty-Guided End-to-End Audio-Visual Speaker Diarization for Far-Field Recordings
    Yang, Chenyu
    Chen, Mengxi
    Wang, Yanfeng
    Wang, Yu
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4031 - 4041