Multisensory Fusion for Unsupervised Spatiotemporal Speaker Diarization

被引:0
|
作者
Xylogiannis, Paris [1 ]
Vryzas, Nikolaos [1 ]
Vrysis, Lazaros [1 ]
Dimoulas, Charalampos [1 ]
机构
[1] Aristotle Univ Thessaloniki, Multidisciplinary Media & Mediated Commun Res Grp, Thessaloniki 54636, Greece
关键词
speaker diarization; sound localization; AI-enabled systems; multimodal decision making; deep learning; smartphones;
D O I
10.3390/s24134229
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Speaker diarization consists of answering the question of "who spoke when" in audio recordings. In meeting scenarios, the task of labeling audio with the corresponding speaker identities can be further assisted by the exploitation of spatial features. This work proposes a framework designed to assess the effectiveness of combining speaker embeddings with Time Difference of Arrival (TDOA) values from available microphone sensor arrays in meetings. We extract speaker embeddings using two popular and robust pre-trained models, ECAPA-TDNN and X-vectors, and calculate the TDOA values via the Generalized Cross-Correlation (GCC) method with Phase Transform (PHAT) weighting. Although ECAPA-TDNN outperforms the Xvectors model, we utilize both speaker embedding models to explore the potential of employing a computationally lighter model when spatial information is exploited. Various techniques for combining the spatial-temporal information are examined in order to determine the best clustering method. The proposed framework is evaluated on two multichannel datasets: the AVLab Speaker Localization dataset and a multichannel dataset (SpeaD-M3C) enriched in the context of the present work with supplementary information from smartphone recordings. Our results strongly indicate that the integration of spatial information can significantly improve the performance of state-of-the-art deep learning diarization models, presenting a 2-3% reduction in DER compared to the baseline approach on the evaluated datasets.
引用
收藏
页数:14
相关论文
共 50 条
  • [31] Unsupervised help-trained LS-SVR-based segmentation in speaker diarization system
    Farshad Teimoori
    Farbod Razzazi
    Multimedia Tools and Applications, 2019, 78 : 11743 - 11777
  • [32] Unsupervised help-trained LS-SVR-based segmentation in speaker diarization system
    Teimoori, Farshad
    Razzazi, Farbod
    MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (09) : 11743 - 11777
  • [33] Speaker Diarization with Lexical Information
    Park, Tae Jin
    Han, Kyu J.
    Huang, Jing
    He, Xiaodong
    Zhou, Bowen
    Georgiou, Panayiotis
    Narayanan, Shrikanth
    INTERSPEECH 2019, 2019, : 391 - 395
  • [34] Speaker count: a new building block for speaker diarization
    Duong, Thanh Thi-Hien
    Nguyen, Phi-Le
    Nguyen, Hong-Son
    Nguyen, Duc-Chien
    Phan, Huy
    Duong, Ngoc Q. K.
    2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 1149 - 1155
  • [35] Bayes Factor Based Speaker Segmentation for Speaker Diarization
    Speech and Audio Research Laboratory, Queensland University of Technology, Brisbane, Australia
    Proc. Annu. Conf. Int. Speech. Commun. Assoc., INTERSPEECH, (1405-1408):
  • [36] Bayes Factor Based Speaker Segmentation for Speaker Diarization
    Wang, D.
    Vogt, R.
    Sridharan, S.
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, 2010, : 1405 - 1408
  • [37] Factor Analysis for Speaker Segmentation and Improved Speaker Diarization
    Desplanques, Brecht
    Demuynck, Kris
    Martens, Jean-Pierre
    16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 3081 - 3085
  • [38] Speaker-Corrupted Embeddings for Online Speaker Diarization
    Ghahabi, Omid
    Fischer, Volker
    INTERSPEECH 2019, 2019, : 386 - 390
  • [39] Online Neural Speaker Diarization With Target Speaker Tracking
    Wang, Weiqing
    Li, Ming
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 5078 - 5091
  • [40] Exploring methods of improving speaker accuracy for speaker diarization
    Knox, Mary Tai
    Mirghafori, Nikki
    Friedland, Gerald
    14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 2782 - 2786