Reading to Listen at the Cocktail Party: Multi-Modal Speech Separation

被引:15
|
作者
Rahimi, Akam [1 ]
Afouras, Triantafyllos [1 ]
Zisserman, Andrew [1 ]
机构
[1] Univ Oxford, Dept Engn Sci, VGG, Oxford, England
来源
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2022年
基金
英国工程与自然科学研究理事会;
关键词
D O I
10.1109/CVPR52688.2022.01024
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The goal of this paper is speech separation and enhancement in multi-speaker and noisy environments using a combination of different modalities. Previous works have shown good performance when conditioning on temporal or static visual evidence such as synchronised lip movements or face identity. In this paper, we present a unified framework for multi-modal speech separation and enhancement based on synchronous or asynchronous cues. To that end we make the following contributions: (i) we design a modern Transformer-based architecture tailored to fuse different modalities to solve the speech separation task in the raw waveform domain; (ii) we propose conditioning on the textual content of a sentence alone or in combination with visual information; (iii) we demonstrate the robustness of our model to audio-visual synchronisation offsets; and, (iv) we obtain state-of-the-art performance on the well-established benchmark datasets LRS2 and LRS3.
引用
收藏
页码:10483 / 10492
页数:10
相关论文
共 50 条
  • [1] Listen, Watch and Understand at the Cocktail Party: Audio-Visual-Contextual Speech Separation
    Li, Chenda
    Qian, Yanmin
    INTERSPEECH 2020, 2020, : 1426 - 1430
  • [2] Multi-Modal Multi-Channel Target Speech Separation
    Gu, Rongzhi
    Zhang, Shi-Xiong
    Xu, Yong
    Chen, Lianwu
    Zou, Yuexian
    Yu, Dong
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2020, 14 (03) : 530 - 541
  • [3] Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
    Ephrat, Ariel
    Mosseri, Inbar
    Lang, Oran
    Dekel, Tali
    Wilson, Kevin
    Hassidim, Avinatan
    Freeman, William T.
    Rubinstein, Michael
    ACM TRANSACTIONS ON GRAPHICS, 2018, 37 (04):
  • [4] Look&listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement
    Xiong, Junwen
    Zhou, Yu
    Zhang, Peng
    Xie, Lei
    Huang, Wei
    Zha, Yufei
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 5800 - 5812
  • [5] Multi-Modal Multi-Correlation Learning for Audio-Visual Speech Separation
    Wang, Xiaoyu
    Kong, Xiangyu
    Peng, Xiulian
    Lu, Yan
    INTERSPEECH 2022, 2022, : 886 - 890
  • [6] A review on speech separation in cocktail party environment: challenges and approaches
    Jharna Agrawal
    Manish Gupta
    Hitendra Garg
    Multimedia Tools and Applications, 2023, 82 : 31035 - 31067
  • [7] A review on speech separation in cocktail party environment: challenges and approaches
    Agrawal, Jharna
    Gupta, Manish
    Garg, Hitendra
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (20) : 31035 - 31067
  • [8] Multi-modal Attention for Speech Emotion Recognition
    Pan, Zexu
    Luo, Zhaojie
    Yang, Jichen
    Li, Haizhou
    INTERSPEECH 2020, 2020, : 364 - 368
  • [9] SEANet: A Multi-modal Speech Enhancement Network
    Tagliasacchi, Marco
    Li, Yunpeng
    Misiunas, Karolis
    Roblek, Dominik
    INTERSPEECH 2020, 2020, : 1126 - 1130
  • [10] Multi-text multi-modal reading processes and comprehension
    Cromley, Jennifer G.
    Kunze, Andrea J.
    Dane, Aygul Parpucu
    LEARNING AND INSTRUCTION, 2021, 71