MAAS: Multi-modal Assignation for Active Speaker Detection

被引:17
|
作者
Leon Alcazar, Juan [1 ]
Heilbron, Fabian Caba [2 ]
Thabet, Ali K. [1 ]
Ghanem, Bernard [1 ]
机构
[1] King Abdullah Univ Sci & Technol KAUST, Thuwal, Saudi Arabia
[2] Adobe Res, San Jose, CA USA
关键词
DIARIZATION;
D O I
10.1109/ICCV48922.2021.00033
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Active speaker detection requires a mindful integration of multi-modal cues. Current methods focus on modeling and fusing short-term audiovisual features for individual speakers, often at frame level. We present a novel approach to active speaker detection that directly addresses the multi-modal nature of the problem and provides a straightforward strategy, where independent visual features (speakers) in the scene are assigned to a previously detected speech event. Our experiments show that a small graph data structure built from local information can approximate an instantaneous audio-visual assignment problem. Moreover, the temporal extension of this initial graph achieves a new state-of-the-art performance on the AVA-ActiveSpeaker dataset with a mAP of 88.8%.
引用
收藏
页码:265 / 274
页数:10
相关论文
共 50 条
  • [1] Look&listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement
    Xiong, Junwen
    Zhou, Yu
    Zhang, Peng
    Xie, Lei
    Huang, Wei
    Zha, Yufei
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 5800 - 5812
  • [2] On-Line Multi-Modal Speaker Diarization
    Noulas, Athanasios K.
    Krose, Ben J. A.
    [J]. ICMI'07: PROCEEDINGS OF THE NINTH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERFACES, 2007, : 350 - 357
  • [3] LIMUSE: LIGHTWEIGHT MULTI-MODAL SPEAKER EXTRACTION
    Liu, Qinghua
    Huang, Yating
    Hao, Yunzhe
    Xu, Jiaming
    Xu, Bo
    [J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 488 - 495
  • [4] MULTI-MODAL FRONT-END FOR SPEAKER ACTIVITY DETECTION IN SMALL MEETINGS
    Even, Jani
    Heracleous, Panikos
    Ishi, Carlos
    Hagita, Norihiro
    [J]. 2011 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, 2011, : 536 - 541
  • [5] MSDWILD: MULTI-MODAL SPEAKER DIARIZATION DATASET IN THE WILD
    Liu, Tao
    Fang, Shuai
    Xiang, Xu
    Song, Hongbo
    Lin, Shaoxiong
    Sun, Jiaqi
    Han, Tianyuan
    Chen, Siyuan
    Yao, Binwei
    Liu, Sen
    Wu, Yifei
    Qian, Yanmin
    Yu, Kai
    [J]. INTERSPEECH 2022, 2022, : 1476 - 1480
  • [6] Multi-modal Fusion Framework with Particle Filter for Speaker Tracking
    Saeed, Anwar
    Al-Hamadi, Ayoub
    Heuer, Michael
    [J]. INTERNATIONAL JOURNAL OF FUTURE GENERATION COMMUNICATION AND NETWORKING, 2012, 5 (04): : 65 - 76
  • [7] Diarizing Large Corpora using Multi-modal Speaker Linking
    Ferras, Marc
    Masneri, Stefano
    Schreer, Oliver
    Bourlard, Herve
    [J]. 15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4, 2014, : 602 - 606
  • [8] MUSE: MULTI-MODAL TARGET SPEAKER EXTRACTION WITH VISUAL CUES
    Pan, Zexu
    Tao, Ruijie
    Xu, Chenglin
    Li, Haizhou
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6678 - 6682
  • [9] Multi-modal pedestrian detection with misalignment based on modal-wise regression and multi-modal IoU
    Wanchaitanawong, Napat
    Tanaka, Masayuki
    Shibata, Takashi
    Okutomi, Masatoshi
    [J]. JOURNAL OF ELECTRONIC IMAGING, 2023, 32 (01)
  • [10] Is Multi-Modal Necessarily Better? Robustness Evaluation of Multi-Modal Fake News Detection
    Chen, Jinyin
    Jia, Chengyu
    Zheng, Haibin
    Chen, Ruoxi
    Fu, Chenbo
    [J]. IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING, 2023, 10 (06): : 3144 - 3158