Robust Self-Supervised Audio-Visual Speech Recognition

被引：10

作者：

Shi, Bowen ^{[1
]}

Hsu, Wei-Ning ^{[2
]}

Mohamed, Abdelrahman ^{[2
]}

机构：

[1] Toyota Technol Inst Chicago, Chicago, IL 61801 USA

[2] Meta AI, New York, NY USA

来源：

INTERSPEECH 2022 | 2022年

关键词：

audio-visual speech recognition; self-supervised learning; representation learning; robust speech recognition;

D O I：

10.21437/Interspeech.2022-99

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Audio-based automatic speech recognition (ASR) degrades significantly in noisy environments and is particularly vulnerable to interfering speech, as the model cannot determine which speaker to transcribe. Audio-visual speech recognition (AVSR) systems improve robustness by complementing the audio stream with the visual information that is invariant to noise and helps the model focus on the desired speaker. However, previous AVSR work focused solely on the supervised learning setup; hence the progress was hindered by the amount of labeled data available. In this work, we present a self-supervised AVSR framework built upon Audio-Visual HuBERT (AV-HuBERT), a state-of-the-art audio-visual speech representation learning model. On the largest available AVSR benchmark dataset LRS3, our approach outperforms prior state-of-the-art by similar to 50% (28.0% vs. 14.1%) using less than 10% of labeled data (433hr vs. 30hr) in the presence of babble noise, while reducing the WER of an audio-based model by over 75% (25.8% vs. 5.8%) on average (1).

引用

页码：2118 / 2122

页数：5

共 50 条

[1] Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition
Pan, Xichen
Chen, Peiyu
Gong, Yichen
Zhou, Helong
Wang, Xinbing
Lin, Zhouhan
PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 4491 - 4503
[2] SELF-SUPERVISED CONTRASTIVE LEARNING FOR AUDIO-VISUAL ACTION RECOGNITION
Liu, Yang
Tan, Ying
Lan, Haoyuan
2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 1000 - 1004
[3] Audio-visual speech recognition based on joint training with audio-visual speech enhancement for robust speech recognition
Hwang, Jung-Wook
Park, Jeongkyun
Park, Rae-Hong
Park, Hyung-Min
APPLIED ACOUSTICS, 2023, 211
[4] Audio-Visual Self-Supervised Terrain Type Recognition for Ground Mobile Platforms
Kurobe, Akiyoshi
Nakajima, Yoshikatsu
Kitani, Kris
Saito, Hideo
IEEE ACCESS, 2021, 9 : 29970 - 29979
[5] SELF-SUPERVISED AUDIO-VISUAL CO-SEGMENTATION
Rouditchenko, Andrew
Zhao, Hang
Gan, Chuang
McDermott, Josh
Torralba, Antonio
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 2357 - 2361
[6] SELF-SUPERVISED LEARNING FOR AUDIO-VISUAL SPEAKER DIARIZATION
Ding, Yifan
Xu, Yong
Zhang, Shi-Xiong
Cong, Yahuan
Wang, Liqiang
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4367 - 4371
[7] A Robust Audio-visual Speech Recognition Using Audio-visual Voice Activity Detection
Tamura, Satoshi
Ishikawa, Masato
Hashiba, Takashi
Takeuchi, Shin'ichi
Hayamizu, Satoru
11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2702 - +
[8] HiCMAE: Hierarchical Contrastive Masked Autoencoder for self-supervised Audio-Visual Emotion Recognition
Sun, Licai
Lian, Zheng
Liu, Bin
Tao, Jianhua
Information Fusion, 2024, 108
[9] HiCMAE: Hierarchical Contrastive Masked Autoencoder for self-supervised Audio-Visual Emotion Recognition
Sun, Licai
Lian, Zheng
Liu, Bin
Tao, Jianhua
INFORMATION FUSION, 2024, 108
[10] Audio-visual fuzzy fusion for robust speech recognition
Malcangi, M.
Ouazzane, K.
Patel, P.
2013 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2013,

← 1 2 3 4 5 →