MSDWILD: MULTI-MODAL SPEAKER DIARIZATION DATASET IN THE WILD

被引:3
|
作者
Liu, Tao [1 ]
Fang, Shuai [2 ]
Xiang, Xu [2 ]
Song, Hongbo [2 ]
Lin, Shaoxiong [1 ]
Sun, Jiaqi [1 ]
Han, Tianyuan [1 ]
Chen, Siyuan [1 ]
Yao, Binwei [1 ]
Liu, Sen [1 ]
Wu, Yifei [1 ]
Qian, Yanmin [1 ]
Yu, Kai [1 ]
机构
[1] Shanghai Jiao Tong Univ, AI Inst, MoE Key Lab Artificial Intelligence, X LANCE Lab, Shanghai, Peoples R China
[2] AISpeech Ltd, Suzhou, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
speaker diarization; multi-modality; audio-visual;
D O I
10.21437/Interspeech.2022-10466
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speaker diarization in real-world acoustic environments is a challenging task of increasing interest from both academia and industry. Although it has been widely accepted that incorporating visual information benefits audio processing tasks such as speech recognition, there is currently no fully released dataset that can be used for benchmarking multi-modal speaker diarization performance in real-world environments. In this paper, we release MSDWild(*), a benchmark dataset for multimodal speaker diarization in the wild. The dataset is collected from public videos, covering rich real-world scenarios and languages. All video clips are naturally shot videos without over-editing such as lens switching. Audio and video are both released. In particular, MSDWild has a large portion of the naturally overlapped speech, forming an excellent testbed for cocktail-party problem research. Furthermore, we also conduct baseline experiments on the dataset using audio-only, visual-only, and audio-visual speaker diarization.
引用
收藏
页码:1476 / 1480
页数:5
相关论文
共 50 条
  • [1] On-Line Multi-Modal Speaker Diarization
    Noulas, Athanasios K.
    Krose, Ben J. A.
    [J]. ICMI'07: PROCEEDINGS OF THE NINTH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERFACES, 2007, : 350 - 357
  • [2] MULTI-MODAL SPEAKER DIARIZATION OF REAL-WORLD MEETINGS USING COMPRESSED-DOMAIN VIDEO FEATURES
    Friedland, Gerald
    Hung, Hayley
    Yeo, Chuohao
    [J]. 2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 4069 - +
  • [3] LIMUSE: LIGHTWEIGHT MULTI-MODAL SPEAKER EXTRACTION
    Liu, Qinghua
    Huang, Yating
    Hao, Yunzhe
    Xu, Jiaming
    Xu, Bo
    [J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 488 - 495
  • [4] MAAS: Multi-modal Assignation for Active Speaker Detection
    Leon Alcazar, Juan
    Heilbron, Fabian Caba
    Thabet, Ali K.
    Ghanem, Bernard
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 265 - 274
  • [5] A multi-subject, multi-modal human neuroimaging dataset
    Daniel G Wakeman
    Richard N Henson
    [J]. Scientific Data, 2
  • [6] A multi-subject, multi-modal human neuroimaging dataset
    Wakeman, Daniel G.
    Henson, Richard N.
    [J]. SCIENTIFIC DATA, 2015, 2
  • [7] Multi-modal Queried Object Detection in the Wild
    Xu, Yifan
    Zhang, Mengdan
    Fu, Chaoyou
    Chen, Peixian
    Yang, Xiaoshan
    Li, Ke
    Xu, Changsheng
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [8] SynDrone - Multi-modal UAV Dataset for Urban Scenarios
    Rizzoli, Giulia
    Barbato, Francesco
    Caligiuri, Matteo
    Zanuttigh, Pietro
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 2202 - 2212
  • [9] MMChat: Multi-Modal Chat Dataset on Social Media
    Zheng, Yinhe
    Chen, Guanyi
    Liu, Xin
    Sun, Jian
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 5778 - 5786
  • [10] A multi-modal dataset for gait recognition under occlusion
    Li, Na
    Zhao, Xinbo
    [J]. APPLIED INTELLIGENCE, 2023, 53 (02) : 1517 - 1534