Reading to Listen at the Cocktail Party: Multi-Modal Speech Separation

被引：15

作者：

Rahimi, Akam ^{[1
]}

Afouras, Triantafyllos ^{[1
]}

Zisserman, Andrew ^{[1
]}

机构：

[1] Univ Oxford, Dept Engn Sci, VGG, Oxford, England

来源：

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2022年

基金：

英国工程与自然科学研究理事会;

关键词：

D O I：

10.1109/CVPR52688.2022.01024

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The goal of this paper is speech separation and enhancement in multi-speaker and noisy environments using a combination of different modalities. Previous works have shown good performance when conditioning on temporal or static visual evidence such as synchronised lip movements or face identity. In this paper, we present a unified framework for multi-modal speech separation and enhancement based on synchronous or asynchronous cues. To that end we make the following contributions: (i) we design a modern Transformer-based architecture tailored to fuse different modalities to solve the speech separation task in the raw waveform domain; (ii) we propose conditioning on the textual content of a sentence alone or in combination with visual information; (iii) we demonstrate the robustness of our model to audio-visual synchronisation offsets; and, (iv) we obtain state-of-the-art performance on the well-established benchmark datasets LRS2 and LRS3.

引用

页码：10483 / 10492

页数：10

共 50 条

[1] Listen, Watch and Understand at the Cocktail Party: Audio-Visual-Contextual Speech Separation
Li, Chenda
Qian, Yanmin
INTERSPEECH 2020, 2020, : 1426 - 1430
[2] Multi-Modal Multi-Channel Target Speech Separation
Gu, Rongzhi
Zhang, Shi-Xiong
Xu, Yong
Chen, Lianwu
Zou, Yuexian
Yu, Dong
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2020, 14 (03) : 530 - 541
[3] Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
Ephrat, Ariel
Mosseri, Inbar
Lang, Oran
Dekel, Tali
Wilson, Kevin
Hassidim, Avinatan
Freeman, William T.
Rubinstein, Michael
ACM TRANSACTIONS ON GRAPHICS, 2018, 37 (04):
[4] Look&listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement
Xiong, Junwen
Zhou, Yu
Zhang, Peng
Xie, Lei
Huang, Wei
Zha, Yufei
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 5800 - 5812
[5] Multi-Modal Multi-Correlation Learning for Audio-Visual Speech Separation
Wang, Xiaoyu
Kong, Xiangyu
Peng, Xiulian
Lu, Yan
INTERSPEECH 2022, 2022, : 886 - 890
[6] A review on speech separation in cocktail party environment: challenges and approaches
Jharna Agrawal
Manish Gupta
Hitendra Garg
Multimedia Tools and Applications, 2023, 82 : 31035 - 31067
[7] A review on speech separation in cocktail party environment: challenges and approaches
Agrawal, Jharna
Gupta, Manish
Garg, Hitendra
MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (20) : 31035 - 31067
[8] Multi-modal Attention for Speech Emotion Recognition
Pan, Zexu
Luo, Zhaojie
Yang, Jichen
Li, Haizhou
INTERSPEECH 2020, 2020, : 364 - 368
[9] SEANet: A Multi-modal Speech Enhancement Network
Tagliasacchi, Marco
Li, Yunpeng
Misiunas, Karolis
Roblek, Dominik
INTERSPEECH 2020, 2020, : 1126 - 1130
[10] Multi-text multi-modal reading processes and comprehension
Cromley, Jennifer G.
Kunze, Andrea J.
Dane, Aygul Parpucu
LEARNING AND INSTRUCTION, 2021, 71

← 1 2 3 4 5 →