MICROSOFT SPEAKER DIARIZATION SYSTEM FOR THE VOXCELEB SPEAKER RECOGNITION CHALLENGE 2020

被引：26

作者：

Xiao, Xiong ^{[1
]}

Kanda, Naoyuki ^{[1
]}

Chen, Zhuo ^{[1
]}

Zhou, Tianyan ^{[1
]}

Yoshioka, Takuya ^{[1
]}

Chen, Sanyuan ^{[1
]}

Zhao, Yong ^{[1
]}

Liu, Gang ^{[1
]}

Wu, Yu ^{[1
]}

Wu, Jian ^{[1
]}

Liu, Shujie ^{[1
]}

Li, Jinyu ^{[1
]}

Gong, Yifan ^{[1
]}

机构：

[1] Microsoft Corp, Redmond, WA 98052 USA

来源：

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年

关键词：

speaker diarization; speaker recognition; speech separation; system fusion;

D O I：

10.1109/ICASSP39728.2021.9413832

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This paper describes the Microsoft speaker diarization system for monaural multi-talker recordings in the wild, evaluated at the diarization track of the VoxCeleb Speaker Recognition Challenge (VoxSRC) 2020. We will first explain our system design to address issues in handling real multi-talker recordings. We then present the details of the components, which include Res2Net-based speaker embedding extractor, conformer-based continuous speech separation with leakage filtering, and a modified DOVER (short for Diarization Output Voting Error Reduction) method for system fusion. We evaluate the systems with the data set provided by VoxSRC challenge 2020, which contains real-life multi-talker audio collected from YouTube. Our best system achieves 3.71% and 6.23% of the diarization error rate (DER) on development set and evaluation set, respectively, being ranked the 1st at the diarization track of the challenge.

引用

页码：5824 / 5828

页数：5

共 50 条

[1] The BUCEA Speaker Diarization System for the VoxCeleb Speaker Recognition Challenge 2022
Zhou, Ruohua
Du, Yuxuan
Hu, Chenlei
[J]. arXiv, 2022,
[2] The VoxCeleb Speaker Recognition Challenge: A Retrospective
Huh, Jaesung
Chung, Joon Son
Nagrani, Arsha
Brown, Andrew
Jung, Jee-weon
Garcia-Romero, Daniel
Zisserman, Andrew
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 3850 - 3866
[3] TSUP Speaker Diarization System for Conversational Short-phrase Speaker Diarization Challenge
Pang, Bowen
Zhao, Huan
Zhang, Gaosheng
Yang, Xiaoyue
Sun, Yang
Zhang, Li
Wang, Qing
Xie, Lei
[J]. 2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2022, : 502 - 506
[4] Investigating Various Diarization Algorithms for Speaker in the Wild (SITW) Speaker Recognition Challenge
Liu, Yi
Tian, Yao
He, Liang
Liu, Jia
[J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 853 - 857
[5] The X-Lance Speaker Diarization System for the Conversational Short-phrase Speaker Diarization Challenge 2022
Liu, Tao
Xiang, Xu
Chen, Zhengyang
Han, Bing
Yu, Kai
Qian, Yanmin
[J]. 2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2022, : 498 - 501
[6] ViVoLAB Speaker Diarization System for the DIHARD 2019 Challenge
Vinals, Ignacio
Gimeno, Pablo
Ortega, Alfonso
Miguel, Antonio
Lleida, Eduardo
[J]. INTERSPEECH 2019, 2019, : 988 - 992
[7] Speaker Diarization with Deep Speaker Embeddings for DIHARD Challenge II
Novoselov, Sergey
Gusev, Aleksei
Ivanov, Artem
Pekhovsky, Timur
Shulipa, Andrey
Avdeeva, Anastasia
Gorlanov, Artem
Kozlov, Alexandr
[J]. INTERSPEECH 2019, 2019, : 1003 - 1007
[8] VoxCeleb2: Deep Speaker Recognition
Chung, Joon Son
Nagrani, Arsha
Zisserman, Andrew
[J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 1086 - 1090
[9] An Improved Speaker Diarization System
Fu, Rong
Benest, Ian D.
[J]. INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 1253 - 1256
[10] A Speaker Recognition System for the SITW Challenge
Kudashev, Oleg
Novoselov, Sergey
Simonchik, Konstantin
Kozlov, Alexandr
[J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 833 - 837

← 1 2 3 4 5 →