Robust Self-Supervised Audio-Visual Speech Recognition

被引:10
|
作者
Shi, Bowen [1 ]
Hsu, Wei-Ning [2 ]
Mohamed, Abdelrahman [2 ]
机构
[1] Toyota Technol Inst Chicago, Chicago, IL 61801 USA
[2] Meta AI, New York, NY USA
来源
关键词
audio-visual speech recognition; self-supervised learning; representation learning; robust speech recognition;
D O I
10.21437/Interspeech.2022-99
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Audio-based automatic speech recognition (ASR) degrades significantly in noisy environments and is particularly vulnerable to interfering speech, as the model cannot determine which speaker to transcribe. Audio-visual speech recognition (AVSR) systems improve robustness by complementing the audio stream with the visual information that is invariant to noise and helps the model focus on the desired speaker. However, previous AVSR work focused solely on the supervised learning setup; hence the progress was hindered by the amount of labeled data available. In this work, we present a self-supervised AVSR framework built upon Audio-Visual HuBERT (AV-HuBERT), a state-of-the-art audio-visual speech representation learning model. On the largest available AVSR benchmark dataset LRS3, our approach outperforms prior state-of-the-art by similar to 50% (28.0% vs. 14.1%) using less than 10% of labeled data (433hr vs. 30hr) in the presence of babble noise, while reducing the WER of an audio-based model by over 75% (25.8% vs. 5.8%) on average (1).
引用
收藏
页码:2118 / 2122
页数:5
相关论文
共 50 条
  • [31] Audio-visual integration for speech recognition
    Kober, R
    Harz, U
    NEUROLOGY PSYCHIATRY AND BRAIN RESEARCH, 1996, 4 (04) : 179 - 184
  • [32] Audio-visual speech recognition by speechreading
    Zhang, XZ
    Mersereau, RM
    Clements, MA
    DSP 2002: 14TH INTERNATIONAL CONFERENCE ON DIGITAL SIGNAL PROCESSING PROCEEDINGS, VOLS 1 AND 2, 2002, : 1069 - 1072
  • [33] Toward Robust Mispronunciation Detection via Audio-Visual Speech Recognition
    Karbasi, Mahdie
    Zeiler, Steffen
    Freiwald, Jan
    Kolossa, Dorothea
    ADVANCES IN COMPUTATIONAL INTELLIGENCE, IWANN 2019, PT II, 2019, 11507 : 655 - 666
  • [34] Self-Supervised Learning of Audio Representations From Audio-Visual Data Using Spatial Alignment
    Wang, Shanshan
    Politis, Archontis
    Mesaros, Annamaria
    Virtanen, Tuomas
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1467 - 1479
  • [35] Using Twin-HMM-Based Audio-Visual Speech Enhancement as a Front-End for Robust Audio-Visual Speech Recognition
    Abdelaziz, Ahmed Hussen
    Zeiler, Steffen
    Kolossa, Dorothea
    14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 867 - 871
  • [36] An audio-visual speech recognition system for testing new audio-visual databases
    Pao, Tsang-Long
    Liao, Wen-Yuan
    VISAPP 2006: PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON COMPUTER VISION THEORY AND APPLICATIONS, VOL 2, 2006, : 192 - +
  • [37] Robust Audio-Visual Speech Recognition Under Noisy Audio-Video Conditions
    Stewart, Darryl
    Seymour, Rowan
    Pass, Adrian
    Ming, Ji
    IEEE TRANSACTIONS ON CYBERNETICS, 2014, 44 (02) : 175 - 184
  • [38] LEARNING CONTEXTUALLY FUSED AUDIO-VISUAL REPRESENTATIONS FOR AUDIO-VISUAL SPEECH RECOGNITION
    Zhang, Zi-Qiang
    Zhang, Jie
    Zhang, Jian-Shu
    Wu, Ming-Hui
    Fang, Xin
    Dai, Li-Rong
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1346 - 1350
  • [39] Audio-Visual Speech Recognition in Noisy Audio Environments
    Palecek, Karel
    Chaloupka, Josef
    2013 36TH INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS AND SIGNAL PROCESSING (TSP), 2013, : 484 - 487
  • [40] Robust Audio-Visual Contrastive Learning for Proposal-Based Self-Supervised Sound Source Localization in Videos
    Xuan, Hanyu
    Wu, Zhiliang
    Yang, Jian
    Jiang, Bo
    Luo, Lei
    Alameda-Pineda, Xavier
    Yan, Yan
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (07) : 4896 - 4907