MSDWILD: MULTI-MODAL SPEAKER DIARIZATION DATASET IN THE WILD

被引：3

作者：

Liu, Tao ^{[1
]}

Fang, Shuai ^{[2
]}

Xiang, Xu ^{[2
]}

Song, Hongbo ^{[2
]}

Lin, Shaoxiong ^{[1
]}

Sun, Jiaqi ^{[1
]}

Han, Tianyuan ^{[1
]}

Chen, Siyuan ^{[1
]}

Yao, Binwei ^{[1
]}

Liu, Sen ^{[1
]}

Wu, Yifei ^{[1
]}

Qian, Yanmin ^{[1
]}

Yu, Kai ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, AI Inst, MoE Key Lab Artificial Intelligence, X LANCE Lab, Shanghai, Peoples R China

[2] AISpeech Ltd, Suzhou, Peoples R China

来源：

INTERSPEECH 2022 | 2022年

基金：

中国国家自然科学基金;

关键词：

speaker diarization; multi-modality; audio-visual;

D O I：

10.21437/Interspeech.2022-10466

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Speaker diarization in real-world acoustic environments is a challenging task of increasing interest from both academia and industry. Although it has been widely accepted that incorporating visual information benefits audio processing tasks such as speech recognition, there is currently no fully released dataset that can be used for benchmarking multi-modal speaker diarization performance in real-world environments. In this paper, we release MSDWild(*), a benchmark dataset for multimodal speaker diarization in the wild. The dataset is collected from public videos, covering rich real-world scenarios and languages. All video clips are naturally shot videos without over-editing such as lens switching. Audio and video are both released. In particular, MSDWild has a large portion of the naturally overlapped speech, forming an excellent testbed for cocktail-party problem research. Furthermore, we also conduct baseline experiments on the dataset using audio-only, visual-only, and audio-visual speaker diarization.

引用

页码：1476 / 1480

页数：5

共 50 条

[1] On-Line Multi-Modal Speaker Diarization
Noulas, Athanasios K.
Krose, Ben J. A.
[J]. ICMI'07: PROCEEDINGS OF THE NINTH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERFACES, 2007, : 350 - 357
[2] MULTI-MODAL SPEAKER DIARIZATION OF REAL-WORLD MEETINGS USING COMPRESSED-DOMAIN VIDEO FEATURES
Friedland, Gerald
Hung, Hayley
Yeo, Chuohao
[J]. 2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 4069 - +
[3] LIMUSE: LIGHTWEIGHT MULTI-MODAL SPEAKER EXTRACTION
Liu, Qinghua
Huang, Yating
Hao, Yunzhe
Xu, Jiaming
Xu, Bo
[J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 488 - 495
[4] MAAS: Multi-modal Assignation for Active Speaker Detection
Leon Alcazar, Juan
Heilbron, Fabian Caba
Thabet, Ali K.
Ghanem, Bernard
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 265 - 274
[5] A multi-subject, multi-modal human neuroimaging dataset
Daniel G Wakeman
Richard N Henson
[J]. Scientific Data, 2
[6] A multi-subject, multi-modal human neuroimaging dataset
Wakeman, Daniel G.
Henson, Richard N.
[J]. SCIENTIFIC DATA, 2015, 2
[7] Multi-modal Queried Object Detection in the Wild
Xu, Yifan
Zhang, Mengdan
Fu, Chaoyou
Chen, Peixian
Yang, Xiaoshan
Li, Ke
Xu, Changsheng
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[8] SynDrone - Multi-modal UAV Dataset for Urban Scenarios
Rizzoli, Giulia
Barbato, Francesco
Caligiuri, Matteo
Zanuttigh, Pietro
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 2202 - 2212
[9] MMChat: Multi-Modal Chat Dataset on Social Media
Zheng, Yinhe
Chen, Guanyi
Liu, Xin
Sun, Jian
[J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 5778 - 5786
[10] A multi-modal dataset for gait recognition under occlusion
Li, Na
Zhao, Xinbo
[J]. APPLIED INTELLIGENCE, 2023, 53 (02) : 1517 - 1534

← 1 2 3 4 5 →