MODALITY ATTENTION FOR END-TO-END AUDIO-VISUAL SPEECH RECOGNITION

被引：0

作者：

Zhou, Pan ^{[1
]}

Yang, Wenwen ^{[2
]}

Chen, Wei ^{[2
]}

Wang, Yanfeng ^{[2
]}

Jia, Jia ^{[1
]}

机构：

[1] Tsinghua Univ, Dept Comp Sci & Technol, Beijing, Peoples R China

[2] Sogou Inc, Voice Interact Technol Ctr, Beijing, Peoples R China

来源：

2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2019年

关键词：

multimodal attention; audio-visual speech recognition; lipreading; sequence-to-sequence;

D O I：

暂无

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Audio-visual speech recognition ( AVSR) system is thought to be one of the most promising solutions for robust speech recognition, especially in noisy environment. In this paper, we propose a novel multimodal attention based method for audio-visual speech recognition which could automatically learn the fused representation from both modalities based on their importance. Our method is realized using state-of-the-art sequence-to-sequence ( Seq2seq) architectures. Experimental results show that relative improvements from 2% up to 36% over the auditory modality alone are obtained depending on the different signal-to-noise-ratio ( SNR). Compared to the traditional feature concatenation methods, our proposed approach can achieve better recognition performance under both clean and noisy conditions. We believe modality attention based end-to-end method can be easily generalized to other multimodal tasks with correlated information.

引用

页码：6565 / 6569

页数：5

共 50 条

[1] End-to-end audio-visual speech recognition for overlapping speech
Rose, Richard
Siohan, Olivier
Tripathi, Anshuman
Braga, Otavio
[J]. INTERSPEECH 2021, 2021, : 3016 - 3020
[2] END-TO-END AUDIO-VISUAL SPEECH RECOGNITION WITH CONFORMERS
Ma, Pingchuan
Petridis, Stavros
Pantic, Maja
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7613 - 7617
[3] FUSING INFORMATION STREAMS IN END-TO-END AUDIO-VISUAL SPEECH RECOGNITION
Yu, Wentao
Zeiler, Steffen
Kolossa, Dorothea
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 3430 - 3434
[4] Investigating the Lombard Effect Influence on End-to-End Audio-Visual Speech Recognition
Ma, Pingchuan
Petridis, Stavros
Pantic, Maja
[J]. INTERSPEECH 2019, 2019, : 4090 - 4094
[5] Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition
Hong, Joanna
Kim, Minsu
Yoo, Daehun
Ro, Yong Man
[J]. INTERSPEECH 2022, 2022, : 2838 - 2842
[6] Audio-Visual End-to-End Multi-Channel Speech Separation, Dereverberation and Recognition
Li, Guinan
Deng, Jiajun
Geng, Mengzhe
Jin, Zengrui
Wang, Tianzi
Hu, Shujie
Cui, Mingyu
Meng, Helen
Liu, Xunying
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 2707 - 2723
[7] End-to-End Bloody Video Recognition by Audio-Visual Feature Fusion
Hou, Congcong
Wu, Xiaoyu
Wang, Ge
[J]. PATTERN RECOGNITION AND COMPUTER VISION (PRCV 2018), PT I, 2018, 11256 : 501 - 510
[8] End-to-End Audio-Visual Neural Speaker Diarization
He, Mao-kui
Du, Jun
Lee, Chin-Hui
[J]. INTERSPEECH 2022, 2022, : 1461 - 1465
[9] END-TO-END MULTI-PERSON AUDIO/VISUAL AUTOMATIC SPEECH RECOGNITION
Braga, Otavio
Makino, Takaki
Siohan, Olivier
Liao, Hank
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6994 - 6998
[10] TRIGGERED ATTENTION FOR END-TO-END SPEECH RECOGNITION
Moritz, Niko
Hori, Takaaki
Le Roux, Jonathan
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 5666 - 5670

← 1 2 3 4 5 →