Multisensory Fusion for Unsupervised Spatiotemporal Speaker Diarization

被引:0
|
作者
Xylogiannis, Paris [1 ]
Vryzas, Nikolaos [1 ]
Vrysis, Lazaros [1 ]
Dimoulas, Charalampos [1 ]
机构
[1] Aristotle Univ Thessaloniki, Multidisciplinary Media & Mediated Commun Res Grp, Thessaloniki 54636, Greece
关键词
speaker diarization; sound localization; AI-enabled systems; multimodal decision making; deep learning; smartphones;
D O I
10.3390/s24134229
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Speaker diarization consists of answering the question of "who spoke when" in audio recordings. In meeting scenarios, the task of labeling audio with the corresponding speaker identities can be further assisted by the exploitation of spatial features. This work proposes a framework designed to assess the effectiveness of combining speaker embeddings with Time Difference of Arrival (TDOA) values from available microphone sensor arrays in meetings. We extract speaker embeddings using two popular and robust pre-trained models, ECAPA-TDNN and X-vectors, and calculate the TDOA values via the Generalized Cross-Correlation (GCC) method with Phase Transform (PHAT) weighting. Although ECAPA-TDNN outperforms the Xvectors model, we utilize both speaker embedding models to explore the potential of employing a computationally lighter model when spatial information is exploited. Various techniques for combining the spatial-temporal information are examined in order to determine the best clustering method. The proposed framework is evaluated on two multichannel datasets: the AVLab Speaker Localization dataset and a multichannel dataset (SpeaD-M3C) enriched in the context of the present work with supplementary information from smartphone recordings. Our results strongly indicate that the integration of spatial information can significantly improve the performance of state-of-the-art deep learning diarization models, presenting a 2-3% reduction in DER compared to the baseline approach on the evaluated datasets.
引用
收藏
页数:14
相关论文
共 50 条
  • [41] Speaker Diarization and Linking of Meeting Data
    Ferras, Marc
    Madikeri, Srikanth
    Bourlard, Herve
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2016, 24 (11) : 1935 - 1945
  • [42] Phone Adaptive Training for Speaker Diarization
    Bozonnet, Simon
    Vipperla, Ravichander
    Evans, Nicholas
    13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 494 - 497
  • [43] Speaker Diarization Using Gesture and Speech
    Gebre, Binyam Gebrekidan
    Wittenburg, Peter
    Drude, Sebastian
    Huijbregts, Marijn
    Heskes, Tom
    15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4, 2014, : 582 - 586
  • [44] Iterative PLDA Adaptation for Speaker Diarization
    Le Lan, Gael
    Charlet, Delphine
    Larcher, Anthony
    Meignier, Sylvain
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 2175 - 2179
  • [45] Group Delay Functions for Speaker Diarization
    Yadav, Mohit
    Sao, Anil Kumar
    Dileep, A. D.
    Rajan, Padmanabhan
    2016 TWENTY SECOND NATIONAL CONFERENCE ON COMMUNICATION (NCC), 2016,
  • [46] An overview of automatic speaker diarization systems
    Tranter, Sue E.
    Reynolds, Douglas A.
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2006, 14 (05): : 1557 - 1565
  • [47] Multistage speaker diarization of broadcast news
    Barras, Claude
    Zhu, Xuan
    Meignier, Sylvain
    Gauvain, Jean-Luc
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2006, 14 (05): : 1505 - 1512
  • [48] Speaker Diarization using Embedding Vectors
    Toruk, Mesut
    Bilgin, Gokhan
    Serbes, Ahmet
    2020 28TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2020,
  • [49] A review on speaker diarization systems and approaches
    Moattar, M. H.
    Homayounpour, M. M.
    SPEECH COMMUNICATION, 2012, 54 (10) : 1065 - 1103
  • [50] Speaker Diarization: A Review of Recent Research
    Anguera Miro, Xavier
    Bozonnet, Simon
    Evans, Nicholas
    Fredouille, Corinne
    Friedland, Gerald
    Vinyals, Oriol
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (02): : 356 - 370