END-TO-END VISUAL SPEECH RECOGNITION WITH LSTMS

被引：0

作者：

Petridis, Stavros ^{[1
]}

Li, Zuwei ^{[1
]}

Pantic, Maja ^{[1
,2
]}

机构：

[1] Imperial Coll London, Dept Comp, London, England

[2] Univ Twente, EEMCS, Enschede, Netherlands

来源：

2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2017年

关键词：

Visual Speech Recognition; Lipreading; End-to-End Training; Long-Short Term Recurrent Neural Networks; Deep Networks;

D O I：

暂无

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Traditional visual speech recognition systems consist of two stages, feature extraction and classification. Recently, several deep learning approaches have been presented which automatically extract features from the mouth images and aim to replace the feature extraction stage. However, research on joint learning of features and classification is very limited. In this work, we present an end-to-end visual speech recognition system based on Long-Short Memory (LSTM) networks. To the best of our knowledge, this is the first model which simultaneously learns to extract features directly from the pixels and perform classification and also achieves state-of-the-art performance in visual speech classification. The model consists of two streams which extract features directly from the mouth and difference images, respectively. The temporal dynamics in each stream are modelled by an LSTM and the fusion of the two streams takes place via a Bidirectional LSTM (BLSTM). An absolute improvement of 9.7% over the base line is reported on the OuluVS2 database, and 1.5% on the CUAVE database when compared with other methods which use a similar visual front-end.

引用

下载

页码：2592 / 2596

页数：5

共 50 条

[31] Investigating the Lombard Effect Influence on End-to-End Audio-Visual Speech Recognition
Ma, Pingchuan
Petridis, Stavros
Pantic, Maja
INTERSPEECH 2019, 2019, : 4090 - 4094
[32] Phonetically Induced Subwords for End-to-End Speech Recognition
Papadourakis, Vasileios
Mueller, Markus
Liu, Jing
Mouchtaris, Athanasios
Omologo, Maurizio
INTERSPEECH 2021, 2021, : 1992 - 1996
[33] Adapting End-to-End Speech Recognition for Readable Subtitles
Liu, Danni
Niehues, Jan
Spanakis, Gerasimos
17TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE TRANSLATION (IWSLT 2020), 2020, : 247 - 256
[34] Hybrid end-to-end model for Kazakh speech recognition
Mamyrbayev O.Z.
Oralbekova D.O.
Alimhan K.
Nuranbayeva B.M.
International Journal of Speech Technology, 2023, 26 (02) : 261 - 270
[35] Insights on Neural Representations for End-to-End Speech Recognition
Ollerenshaw, Anna
Jalal, Asif
Hain, Thomas
INTERSPEECH 2021, 2021, : 4079 - 4083
[36] End-to-End Speech Emotion Recognition With Gender Information
Sun, Ting-Wei
IEEE ACCESS, 2020, 8 (08): : 152423 - 152438
[37] Residual Language Model for End-to-end Speech Recognition
Tsunoo, Emiru
Kashiwagi, Yosuke
Narisetty, Chaitanya
Watanabe, Shinji
INTERSPEECH 2022, 2022, : 3899 - 3903
[38] DEEP CONTEXT: END-TO-END CONTEXTUAL SPEECH RECOGNITION
Pundak, Golan
Sainath, Tara N.
Prabhavalkar, Rohit
Kannan, Anjuli
Zhao, Ding
2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 418 - 425
[39] End-to-end Speech-to-Punctuated-Text Recognition
Nozaki, Jumon
Kawahara, Tatsuya
Ishizuka, Kenkichi
Hashimoto, Taiichi
INTERSPEECH 2022, 2022, : 1811 - 1815
[40] EXPLORING NEURAL TRANSDUCERS FOR END-TO-END SPEECH RECOGNITION
Battenberg, Eric
Chen, Jitong
Child, Rewon
Coates, Adam
Gaur, Yashesh
Li, Yi
Liu, Hairong
Satheesh, Sanjeev
Sriram, Anuroop
Zhu, Zhenyao
2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2017, : 206 - 213

← 1 2 3 4 5 →