END-TO-END VISUAL SPEECH RECOGNITION WITH LSTMS

被引:0
|
作者
Petridis, Stavros [1 ]
Li, Zuwei [1 ]
Pantic, Maja [1 ,2 ]
机构
[1] Imperial Coll London, Dept Comp, London, England
[2] Univ Twente, EEMCS, Enschede, Netherlands
关键词
Visual Speech Recognition; Lipreading; End-to-End Training; Long-Short Term Recurrent Neural Networks; Deep Networks;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Traditional visual speech recognition systems consist of two stages, feature extraction and classification. Recently, several deep learning approaches have been presented which automatically extract features from the mouth images and aim to replace the feature extraction stage. However, research on joint learning of features and classification is very limited. In this work, we present an end-to-end visual speech recognition system based on Long-Short Memory (LSTM) networks. To the best of our knowledge, this is the first model which simultaneously learns to extract features directly from the pixels and perform classification and also achieves state-of-the-art performance in visual speech classification. The model consists of two streams which extract features directly from the mouth and difference images, respectively. The temporal dynamics in each stream are modelled by an LSTM and the fusion of the two streams takes place via a Bidirectional LSTM (BLSTM). An absolute improvement of 9.7% over the base line is reported on the OuluVS2 database, and 1.5% on the CUAVE database when compared with other methods which use a similar visual front-end.
引用
下载
收藏
页码:2592 / 2596
页数:5
相关论文
共 50 条
  • [41] Semi-Supervised End-to-End Speech Recognition
    Karita, Shigeki
    Watanabe, Shinji
    Iwata, Tomoharu
    Ogawa, Atsunori
    Delcroix, Marc
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2 - 6
  • [42] END-TO-END SPEECH RECOGNITION WITH ADAPTIVE COMPUTATION STEPS
    Li, Mohan
    Liu, Min
    Masanori, Hattori
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6246 - 6250
  • [43] End-to-End Neural Segmental Models for Speech Recognition
    Tang, Hao
    Lu, Liang
    Kong, Lingpeng
    Gimpel, Kevin
    Livescu, Karen
    Dyer, Chris
    Smith, Noah A.
    Renals, Steve
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2017, 11 (08) : 1254 - 1264
  • [44] IMPROVING END-TO-END SPEECH RECOGNITION WITH POLICY LEARNING
    Zhou, Yingbo
    Xiong, Caiming
    Socher, Richard
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5819 - 5823
  • [45] STREAMING END-TO-END SPEECH RECOGNITION FOR MOBILE DEVICES
    He, Yanzhang
    Sainath, Tara N.
    Prabhavalkar, Rohit
    McGraw, Ian
    Alvarez, Raziel
    Zhao, Ding
    Rybach, David
    Kannan, Anjuli
    Wu, Yonghui
    Pang, Ruoming
    Liang, Qiao
    Bhatia, Deepti
    Yuan Shangguan
    Li, Bo
    Pundak, Golan
    Sim, Khe Chai
    Bagby, Tom
    Chang, Shuo-yiin
    Rao, Kanishka
    Gruenstein, Alexander
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6381 - 6385
  • [46] SELF-TRAINING FOR END-TO-END SPEECH RECOGNITION
    Kahn, Jacob
    Lee, Ann
    Hannun, Awni
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7084 - 7088
  • [47] Towards end-to-end speech recognition with transfer learning
    Chu-Xiong Qin
    Dan Qu
    Lian-Hai Zhang
    EURASIP Journal on Audio, Speech, and Music Processing, 2018
  • [48] End-to-end named entity recognition for Vietnamese speech
    Nguyen, Thu-Hien
    Nguyen, Thai-Binh
    Do, Quoc-Truong
    Nguyen, Tuan-Linh
    2022 25TH CONFERENCE OF THE ORIENTAL COCOSDA INTERNATIONAL COMMITTEE FOR THE CO-ORDINATION AND STANDARDISATION OF SPEECH DATABASES AND ASSESSMENT TECHNIQUES (O-COCOSDA 2022), 2022,
  • [49] Two-Pass End-to-End Speech Recognition
    Sainath, Tara N.
    Pang, Ruoming
    Rybach, David
    He, Yanzhang
    Prabhavalkar, Rohit
    Li, Wei
    Visontai, Mirko
    Liang, Qiao
    Strohman, Trevor
    Wu, Yonghui
    McGraw, Ian
    Chiu, Chung-Cheng
    INTERSPEECH 2019, 2019, : 2773 - 2777
  • [50] Online Compressive Transformer for End-to-End Speech Recognition
    Leong, Chi-Hang
    Huang, Yu-Han
    Chien, Jen-Tzung
    INTERSPEECH 2021, 2021, : 2082 - 2086