Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model

被引:8
|
作者
Ahmad, Rehan [1 ]
Zubair, Syed [2 ]
Alquhayz, Hani [3 ]
Ditta, Allah [4 ]
机构
[1] Int Islamic Univ, Dept Elect Engn, Islamabad 44000, Pakistan
[2] Analyt Camp, Islamabad 44000, Pakistan
[3] Majmaah Univ, Dept Comp Sci & Informat, Coll Sci Zulfi, Al Majmaah 11952, Saudi Arabia
[4] Univ Educ, Div Sci & Technol, Lahore 54770, Pakistan
关键词
speaker diarization; SyncNet; Gaussian mixture model; diarization error rate; speech activity detection; MFCC; MEETINGS;
D O I
10.3390/s19235163
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Speaker diarization systems aim to find 'who spoke when?' in multi-speaker recordings. The dataset usually consists of meetings, TV/talk shows, telephone and multi-party interaction recordings. In this paper, we propose a novel multimodal speaker diarization technique, which finds the active speaker through audio-visual synchronization model for diarization. A pre-trained audio-visual synchronization model is used to find the synchronization between a visible person and the respective audio. For that purpose, short video segments comprised of face-only regions are acquired using a face detection technique and are then fed to the pre-trained model. This model is a two streamed network which matches audio frames with their respective visual input segments. On the basis of high confidence video segments inferred by the model, the respective audio frames are used to train Gaussian mixture model (GMM)-based clusters. This method helps in generating speaker specific clusters with high probability. We tested our approach on a popular subset of AMI meeting corpus consisting of 5.4 h of recordings for audio and 5.8 h of different set of multimodal recordings. A significant improvement is noticed with the proposed method in term of DER when compared to conventional and fully supervised audio based speaker diarization. The results of the proposed technique are very close to the complex state-of-the art multimodal diarization which shows significance of such simple yet effective technique.
引用
收藏
页数:14
相关论文
共 50 条
  • [1] A PRE-TRAINED AUDIO-VISUAL TRANSFORMER FOR EMOTION RECOGNITION
    Minh Tran
    Soleymani, Mohammad
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4698 - 4702
  • [2] PRISM: Pre-trained Indeterminate Speaker Representation Model for Speaker Diarization and Speaker Verification
    Zheng, Siqi
    Suo, Hongbin
    Chen, Qian
    [J]. INTERSPEECH 2022, 2022, : 1431 - 1435
  • [3] SELF-SUPERVISED LEARNING FOR AUDIO-VISUAL SPEAKER DIARIZATION
    Ding, Yifan
    Xu, Yong
    Zhang, Shi-Xiong
    Cong, Yahuan
    Wang, Liqiang
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4367 - 4371
  • [4] End-to-End Audio-Visual Neural Speaker Diarization
    He, Mao-kui
    Du, Jun
    Lee, Chin-Hui
    [J]. INTERSPEECH 2022, 2022, : 1461 - 1465
  • [5] Speaker Diarization based on Audio-Visual Integration for Smart Posterboard
    Wakabayashi, Yukoh
    Inoue, Koji
    Yoshimoto, Hiromasa
    Kawahara, Tatsuya
    [J]. 2014 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2014,
  • [6] Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion
    Gebru, Israel D.
    Ba, Sileye
    Li, Xiaofei
    Horaud, Radu
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018, 40 (05) : 1086 - 1099
  • [7] AVA-AVD: Audio-Visual Speaker Diarization in the Wild
    Xu, Eric Zhongcong
    Song, Zeyang
    Tsutsui, Satoshi
    Feng, Chao
    Ye, Mang
    Shou, Mike Zheng
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 3838 - 3847
  • [8] Audio-visual speaker diarization using fisher linear semi-discriminant analysis
    Sarafianos, Nikolaos
    Giannakopoulos, Theodoros
    Petridis, Sergios
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2016, 75 (01) : 115 - 130
  • [9] Audio-visual speaker diarization using fisher linear semi-discriminant analysis
    Nikolaos Sarafianos
    Theodoros Giannakopoulos
    Sergios Petridis
    [J]. Multimedia Tools and Applications, 2016, 75 : 115 - 130
  • [10] DyViSE: Dynamic Vision-Guided Speaker Embedding for Audio-Visual Speaker Diarization
    Wuerkaixi, Abudukelimu
    Yan, Kunda
    Zhang, You
    Duan, Zhiyao
    Zhang, Changshui
    [J]. 2022 IEEE 24TH INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP), 2022,