MAAS: Multi-modal Assignation for Active Speaker Detection

被引：17

作者：

Leon Alcazar, Juan ^{[1
]}

Heilbron, Fabian Caba ^{[2
]}

Thabet, Ali K. ^{[1
]}

Ghanem, Bernard ^{[1
]}

机构：

[1] King Abdullah Univ Sci & Technol KAUST, Thuwal, Saudi Arabia

[2] Adobe Res, San Jose, CA USA

来源：

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021) | 2021年

关键词：

DIARIZATION;

D O I：

10.1109/ICCV48922.2021.00033

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Active speaker detection requires a mindful integration of multi-modal cues. Current methods focus on modeling and fusing short-term audiovisual features for individual speakers, often at frame level. We present a novel approach to active speaker detection that directly addresses the multi-modal nature of the problem and provides a straightforward strategy, where independent visual features (speakers) in the scene are assigned to a previously detected speech event. Our experiments show that a small graph data structure built from local information can approximate an instantaneous audio-visual assignment problem. Moreover, the temporal extension of this initial graph achieves a new state-of-the-art performance on the AVA-ActiveSpeaker dataset with a mAP of 88.8%.

引用

页码：265 / 274

页数：10

共 50 条

[1] Look&listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement
Xiong, Junwen
Zhou, Yu
Zhang, Peng
Xie, Lei
Huang, Wei
Zha, Yufei
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 5800 - 5812
[2] On-Line Multi-Modal Speaker Diarization
Noulas, Athanasios K.
Krose, Ben J. A.
[J]. ICMI'07: PROCEEDINGS OF THE NINTH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERFACES, 2007, : 350 - 357
[3] LIMUSE: LIGHTWEIGHT MULTI-MODAL SPEAKER EXTRACTION
Liu, Qinghua
Huang, Yating
Hao, Yunzhe
Xu, Jiaming
Xu, Bo
[J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 488 - 495
[4] MULTI-MODAL FRONT-END FOR SPEAKER ACTIVITY DETECTION IN SMALL MEETINGS
Even, Jani
Heracleous, Panikos
Ishi, Carlos
Hagita, Norihiro
[J]. 2011 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, 2011, : 536 - 541
[5] MSDWILD: MULTI-MODAL SPEAKER DIARIZATION DATASET IN THE WILD
Liu, Tao
Fang, Shuai
Xiang, Xu
Song, Hongbo
Lin, Shaoxiong
Sun, Jiaqi
Han, Tianyuan
Chen, Siyuan
Yao, Binwei
Liu, Sen
Wu, Yifei
Qian, Yanmin
Yu, Kai
[J]. INTERSPEECH 2022, 2022, : 1476 - 1480
[6] Multi-modal Fusion Framework with Particle Filter for Speaker Tracking
Saeed, Anwar
Al-Hamadi, Ayoub
Heuer, Michael
[J]. INTERNATIONAL JOURNAL OF FUTURE GENERATION COMMUNICATION AND NETWORKING, 2012, 5 (04): : 65 - 76
[7] Diarizing Large Corpora using Multi-modal Speaker Linking
Ferras, Marc
Masneri, Stefano
Schreer, Oliver
Bourlard, Herve
[J]. 15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4, 2014, : 602 - 606
[8] MUSE: MULTI-MODAL TARGET SPEAKER EXTRACTION WITH VISUAL CUES
Pan, Zexu
Tao, Ruijie
Xu, Chenglin
Li, Haizhou
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6678 - 6682
[9] Multi-modal pedestrian detection with misalignment based on modal-wise regression and multi-modal IoU
Wanchaitanawong, Napat
Tanaka, Masayuki
Shibata, Takashi
Okutomi, Masatoshi
[J]. JOURNAL OF ELECTRONIC IMAGING, 2023, 32 (01)
[10] Is Multi-Modal Necessarily Better? Robustness Evaluation of Multi-Modal Fake News Detection
Chen, Jinyin
Jia, Chengyu
Zheng, Haibin
Chen, Ruoxi
Fu, Chenbo
[J]. IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING, 2023, 10 (06): : 3144 - 3158

← 1 2 3 4 5 →