Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model

被引:8
|
作者
Ahmad, Rehan [1 ]
Zubair, Syed [2 ]
Alquhayz, Hani [3 ]
Ditta, Allah [4 ]
机构
[1] Int Islamic Univ, Dept Elect Engn, Islamabad 44000, Pakistan
[2] Analyt Camp, Islamabad 44000, Pakistan
[3] Majmaah Univ, Dept Comp Sci & Informat, Coll Sci Zulfi, Al Majmaah 11952, Saudi Arabia
[4] Univ Educ, Div Sci & Technol, Lahore 54770, Pakistan
关键词
speaker diarization; SyncNet; Gaussian mixture model; diarization error rate; speech activity detection; MFCC; MEETINGS;
D O I
10.3390/s19235163
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Speaker diarization systems aim to find 'who spoke when?' in multi-speaker recordings. The dataset usually consists of meetings, TV/talk shows, telephone and multi-party interaction recordings. In this paper, we propose a novel multimodal speaker diarization technique, which finds the active speaker through audio-visual synchronization model for diarization. A pre-trained audio-visual synchronization model is used to find the synchronization between a visible person and the respective audio. For that purpose, short video segments comprised of face-only regions are acquired using a face detection technique and are then fed to the pre-trained model. This model is a two streamed network which matches audio frames with their respective visual input segments. On the basis of high confidence video segments inferred by the model, the respective audio frames are used to train Gaussian mixture model (GMM)-based clusters. This method helps in generating speaker specific clusters with high probability. We tested our approach on a popular subset of AMI meeting corpus consisting of 5.4 h of recordings for audio and 5.8 h of different set of multimodal recordings. A significant improvement is noticed with the proposed method in term of DER when compared to conventional and fully supervised audio based speaker diarization. The results of the proposed technique are very close to the complex state-of-the art multimodal diarization which shows significance of such simple yet effective technique.
引用
收藏
页数:14
相关论文
共 50 条
  • [41] An iVector Extractor Using Pre-trained Neural Networks for Speaker Verification
    Zhang, Shanshan
    Zheng, Rong
    Xu, Bo
    [J]. 2014 9TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2014, : 73 - 77
  • [42] A CONDITIONAL RANDOM FIELD APPROACH FOR AUDIO-VISUAL PEOPLE DIARIZATION
    Paul, Gay
    Elie, Khoury
    Sylvain, Meignier
    Jean-Marc, Odobez
    Paul, Deleglise
    [J]. 2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
  • [43] Online Diarization of Streaming Audio-Visual Data for Smart Environments
    Schmalenstroeer, Joerg
    Haeb-Umbach, Reinhold
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2010, 4 (05) : 845 - 856
  • [44] Tracking the Active Speaker Based on a Joint Audio-Visual Observation Model
    Gebru, Israel D.
    Ba, Sileye
    Evangelidis, Georgios
    Horaud, Radu
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOP (ICCVW), 2015, : 702 - 708
  • [45] Using various pre-trained models for audio feature extraction in automated audio captioning
    Won, Hyejin
    Kim, Baekseung
    Kwak, Il-Youp
    Lim, Changwon
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2023, 231
  • [46] CLIPTrans: Transferring Visual Knowledge with Pre-trained Models for Multimodal Machine Translation
    Gupta, Devaansh
    Kharbanda, Siddhant
    Zhou, Jiawei
    Li, Wanhua
    Pfister, Hanspeter
    Wei, Donglai
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2863 - 2874
  • [47] Multimodal Emotion Recognition using Physiological and Audio-Visual Features
    Matsuda, Yuki
    Fedotov, Dmitrii
    Takahashi, Yuta
    Arakawa, Yutaka
    Yasumo, Keiichi
    Minker, Wolfgang
    [J]. PROCEEDINGS OF THE 2018 ACM INTERNATIONAL JOINT CONFERENCE ON PERVASIVE AND UBIQUITOUS COMPUTING AND PROCEEDINGS OF THE 2018 ACM INTERNATIONAL SYMPOSIUM ON WEARABLE COMPUTERS (UBICOMP/ISWC'18 ADJUNCT), 2018, : 946 - 951
  • [48] Classification of Respiration Sounds Using Deep Pre-trained Audio Embeddings
    Meza, Carlos A. Galindo
    del Hoyo Ontiveros, Juan A.
    Lopez-Meyer, Paulo
    [J]. 2021 IEEE LATIN AMERICAN CONFERENCE ON COMPUTATIONAL INTELLIGENCE (LA-CCI), 2021,
  • [49] Synchronization of Multiple Camera Videos Using Audio-Visual Features
    Shrestha, Prarthana
    Barbieri, Mauro
    Weda, Hans
    Sekulovski, Dragan
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2010, 12 (01) : 79 - 92
  • [50] Audio-visual speaker identification based on the use of dynamic audio and visual features
    Fox, N
    Reilly, RB
    [J]. AUDIO-BASED AND VIDEO-BASED BIOMETRIC PERSON AUTHENTICATION, PROCEEDINGS, 2003, 2688 : 743 - 751