Multisensory Fusion for Unsupervised Spatiotemporal Speaker Diarization

被引:0
|
作者
Xylogiannis, Paris [1 ]
Vryzas, Nikolaos [1 ]
Vrysis, Lazaros [1 ]
Dimoulas, Charalampos [1 ]
机构
[1] Aristotle Univ Thessaloniki, Multidisciplinary Media & Mediated Commun Res Grp, Thessaloniki 54636, Greece
关键词
speaker diarization; sound localization; AI-enabled systems; multimodal decision making; deep learning; smartphones;
D O I
10.3390/s24134229
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Speaker diarization consists of answering the question of "who spoke when" in audio recordings. In meeting scenarios, the task of labeling audio with the corresponding speaker identities can be further assisted by the exploitation of spatial features. This work proposes a framework designed to assess the effectiveness of combining speaker embeddings with Time Difference of Arrival (TDOA) values from available microphone sensor arrays in meetings. We extract speaker embeddings using two popular and robust pre-trained models, ECAPA-TDNN and X-vectors, and calculate the TDOA values via the Generalized Cross-Correlation (GCC) method with Phase Transform (PHAT) weighting. Although ECAPA-TDNN outperforms the Xvectors model, we utilize both speaker embedding models to explore the potential of employing a computationally lighter model when spatial information is exploited. Various techniques for combining the spatial-temporal information are examined in order to determine the best clustering method. The proposed framework is evaluated on two multichannel datasets: the AVLab Speaker Localization dataset and a multichannel dataset (SpeaD-M3C) enriched in the context of the present work with supplementary information from smartphone recordings. Our results strongly indicate that the integration of spatial information can significantly improve the performance of state-of-the-art deep learning diarization models, presenting a 2-3% reduction in DER compared to the baseline approach on the evaluated datasets.
引用
收藏
页数:14
相关论文
共 50 条
  • [1] Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion
    Gebru, Israel D.
    Ba, Sileye
    Li, Xiaofei
    Horaud, Radu
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018, 40 (05) : 1086 - 1099
  • [2] SPEAKER DIARIZATION WITH UNSUPERVISED TRAINING FRAMEWORKL
    Le Lan, Gael
    Meignier, Sylvain
    Charlet, Delphine
    Deleglise, Paul
    2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 5560 - 5564
  • [3] Unsupervised deep feature embeddings for speaker diarization
    Ahmad, Rehan
    Zubair, Syed
    TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2019, 27 (04) : 3138 - 3149
  • [4] Unsupervised Speaker Diarization Using Riemannian Manifold Clustering
    Huang, Che-Wei
    Xiao, Bo
    Georgiou, Panayiotis G.
    Narayanan, Shrikanth S.
    15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4, 2014, : 567 - 571
  • [5] Unsupervised Methods for Speaker Diarization: An Integrated and Iterative Approach
    Shum, Stephen H.
    Dehak, Najim
    Dehak, Reda
    Glass, James R.
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2013, 21 (10): : 2015 - 2028
  • [6] Unsupervised Compensation of Intra-Session Intra-Speaker Variability for Speaker Diarization
    Aronowitz, Hagai
    ODYSSEY 2010: THE SPEAKER AND LANGUAGE RECOGNITION WORKSHOP, 2010, : 138 - 145
  • [7] SPEAKER DIARIZATION WITH PLDA I-VECTOR SCORING AND UNSUPERVISED CALIBRATION
    Sell, Gregory
    Garcia-Romero, Daniel
    2014 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY SLT 2014, 2014, : 413 - 417
  • [8] Unsupervised Speaker Diarization in Distributed IoT Networks Using Federated Learning
    Bhuyan, Amit Kumar
    Dutta, Hrishikesh
    Biswas, Subir
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2024,
  • [9] Novel Architectures for Unsupervised Information Bottleneck Based Speaker Diarization of Meetings
    Dawalatabad, Nauman
    Madikeri, Srikanth
    Sekhar, C. Chandra
    Murthy, Hema A.
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 14 - 27
  • [10] SYSTEM FUSION AND SPEAKER LINKING FOR LONGITUDINAL DIARIZATION OF TV SHOWS
    Ferras, Marc
    Madikeri, Srikanth
    Motlicek, Petr
    Bourlard, Herve
    2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 5495 - 5499