Audio-Visual Speech Recognition Using Bimodal-Trained Bottleneck Features for a Person with Severe Hearing Loss

被引:14
|
作者
Takashima, Yuki [1 ]
Aihara, Ryo [1 ]
Takiguchi, Tetsuya [1 ]
Ariki, Yasuo [1 ]
Mitani, Nobuyuki [2 ]
Omori, Kiyohiro [2 ]
Nakazono, Kaoru [2 ]
机构
[1] Kobe Univ, Grad Sch Syst Informat, Kobe, Hyogo, Japan
[2] Hyogo Inst Assist Technol, Kobe, Hyogo, Japan
关键词
multimodal; lip reading; deep-learning; assistive technology;
D O I
10.21437/Interspeech.2016-721
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this paper, we propose an audio-visual speech recognition system for a person with an articulation disorder resulting from severe hearing loss. In the case of a person with this type of articulation disorder, the speech style is quite different from those of people without hearing loss that a speaker-independent acoustic model for unimpaired persons is hardly useful for recognizing it. The audio-visual speech recognition system we present in this paper is for a person with severe hearing loss in noisy environments. Although feature integration is an important factor in multimodal speech recognition, it is difficult to integrate efficiently because those features are different intrinsically. We propose a novel visual feature extraction approach that connects the lip image to audio features efficiently, and the use of convolutive bottleneck networks (CBNs) increases robustness with respect to speech fluctuations caused by hearing loss. The effectiveness of this approach was confirmed through word-recognition experiments in noisy environments, where the CBN-based feature extraction method outperformed the conventional methods.
引用
收藏
页码:277 / 281
页数:5
相关论文
共 50 条
  • [1] Audio-visual speech recognition using convolutive bottleneck networks for a person with severe hearing loss
    Takashima, Yuki
    Kakihara, Yasuhiro
    Aihara, Ryo
    Takiguchi, Tetsuya
    Ariki, Yasuo
    Mitani, Nobuyuki
    Omori, Kiyohiro
    Nakazono, Kaoru
    [J]. IPSJ Transactions on Computer Vision and Applications, 2015, 7 : 64 - 68
  • [2] Integration of Deep Bottleneck Features for Audio-Visual Speech Recognition
    Ninomiya, Hiroshi
    Kitaoka, Norihide
    Tamura, Satoshi
    Iribe, Yurie
    Takeda, Kazuya
    [J]. 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 563 - 567
  • [3] Audio-visual modeling for bimodal speech recognition
    Kaynak, MN
    Zhi, Q
    Cheok, AD
    Sengupta, K
    Chung, KC
    [J]. 2001 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-5: E-SYSTEMS AND E-MAN FOR CYBERNETICS IN CYBERSPACE, 2002, : 181 - 186
  • [4] Bimodal fusion in audio-visual speech recognition
    Zhang, XZ
    Mersereau, RM
    Clements, M
    [J]. 2002 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL I, PROCEEDINGS, 2002, : 964 - 967
  • [5] Audio-visual speech recognition using deep bottleneck features and high-performance lipreading
    Tamura, Satoshi
    Ninomiya, Hiroshi
    Kitaoka, Norihide
    Osuga, Shin
    Iribe, Yurie
    Takeda, Kazuya
    Hayamizu, Satoru
    [J]. 2015 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2015, : 575 - 582
  • [6] Audio-visual speech recognition using MPEGA compliant visual features
    Aleksic, PS
    Williams, JJ
    Wu, ZL
    Katsaggelos, AK
    [J]. EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING, 2002, 2002 (11) : 1213 - 1227
  • [7] Two-Level Bimodal Association for Audio-Visual Speech Recognition
    Lee, Jong-Seok
    Ebrahimi, Touradj
    [J]. ADVANCED CONCEPTS FOR INTELLIGENT VISION SYSTEMS, PROCEEDINGS, 2009, 5807 : 133 - 144
  • [8] Audio-Visual Speech Recognition Using MPEG-4 Compliant Visual Features
    Petar S. Aleksic
    Jay J. Williams
    Zhilin Wu
    Aggelos K. Katsaggelos
    [J]. EURASIP Journal on Advances in Signal Processing, 2002
  • [9] A Robust Audio-visual Speech Recognition Using Audio-visual Voice Activity Detection
    Tamura, Satoshi
    Ishikawa, Masato
    Hashiba, Takashi
    Takeuchi, Shin'ichi
    Hayamizu, Satoru
    [J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2702 - +
  • [10] Depth-based Features in Audio-Visual Speech Recognition
    Palecek, Karel
    Chaloupka, Josef
    [J]. 2016 39TH INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS AND SIGNAL PROCESSING (TSP), 2016, : 303 - 306