Audio-visual speech recognition using deep bottleneck features and high-performance lipreading

被引:0
|
作者
Tamura, Satoshi [1 ]
Ninomiya, Hiroshi [2 ]
Kitaoka, Norihide [3 ]
Osuga, Shin [4 ]
Iribe, Yurie [5 ]
Takeda, Kazuya [2 ]
Hayamizu, Satoru [1 ]
机构
[1] Gifu Univ, Gifu, Japan
[2] Nagoya Univ, Nagoya, Aichi 4648601, Japan
[3] Tokushima Univ, Tokushima, Japan
[4] Aisin Seiki Co Ltd, Kariya, Aichi, Japan
[5] Aichi Prefectural Univ, Nagakute, Aichi, Japan
关键词
D O I
暂无
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
This paper develops an Audio-Visual Speech Recognition (AVSR) method, by (1) exploring high-performance visual features, (2) applying audio and visual deep bottleneck features to improve AVSR performance, and (3) investigating effectiveness of voice activity detection in a visual modality. In our approach, many kinds of visual features are incorporated, subsequently converted into bottleneck features by deep learning technology. By using proposed features, we successfully achieved 73.66% lipreading accuracy in speaker-independent open condition, and about 90% AVSR accuracy on average in noisy environments. In addition, we extracted speech segments from visual features, resulting 77.80% lipreading accuracy. It is found VAD is useful in both audio and visual modalities, for better lipreading and AVSR.
引用
收藏
页码:575 / 582
页数:8
相关论文
共 50 条
  • [1] Integration of Deep Bottleneck Features for Audio-Visual Speech Recognition
    Ninomiya, Hiroshi
    Kitaoka, Norihide
    Tamura, Satoshi
    Iribe, Yurie
    Takeda, Kazuya
    16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 563 - 567
  • [2] Multi-pose lipreading and audio-visual speech recognition
    Estellers, Virginia
    Thiran, Jean-Philippe
    EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, 2012, : 1 - 23
  • [3] Part-Based Lipreading for Audio-Visual Speech Recognition
    Miao, Ziling
    Liu, Hong
    Yang, Bing
    2020 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2020, : 2722 - 2726
  • [4] Multi-pose lipreading and audio-visual speech recognition
    Virginia Estellers
    Jean-Philippe Thiran
    EURASIP Journal on Advances in Signal Processing, 2012
  • [5] Deep Audio-Visual Speech Recognition
    Afouras, Triantafyllos
    Chung, Joon Son
    Senior, Andrew
    Vinyals, Oriol
    Zisserman, Andrew
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) : 8717 - 8727
  • [6] Audio-visual speech recognition using deep learning
    Noda, Kuniaki
    Yamaguchi, Yuki
    Nakadai, Kazuhiro
    Okuno, Hiroshi G.
    Ogata, Tetsuya
    APPLIED INTELLIGENCE, 2015, 42 (04) : 722 - 737
  • [7] Audio-visual speech recognition using deep learning
    Kuniaki Noda
    Yuki Yamaguchi
    Kazuhiro Nakadai
    Hiroshi G. Okuno
    Tetsuya Ogata
    Applied Intelligence, 2015, 42 : 722 - 737
  • [8] Audio-visual speech recognition using MPEGA compliant visual features
    Aleksic, PS
    Williams, JJ
    Wu, ZL
    Katsaggelos, AK
    EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING, 2002, 2002 (11) : 1213 - 1227
  • [9] Audio-Visual Speech Recognition Using Bimodal-Trained Bottleneck Features for a Person with Severe Hearing Loss
    Takashima, Yuki
    Aihara, Ryo
    Takiguchi, Tetsuya
    Ariki, Yasuo
    Mitani, Nobuyuki
    Omori, Kiyohiro
    Nakazono, Kaoru
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 277 - 281
  • [10] DEEP MULTIMODAL LEARNING FOR AUDIO-VISUAL SPEECH RECOGNITION
    Mroueh, Youssef
    Marcheret, Etienne
    Goel, Vaibhava
    2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 2130 - 2134