Audio-Visual Speech Recognition Using Bimodal-Trained Bottleneck Features for a Person with Severe Hearing Loss

被引：14

作者：

Takashima, Yuki ^{[1
]}

Aihara, Ryo ^{[1
]}

Takiguchi, Tetsuya ^{[1
]}

Ariki, Yasuo ^{[1
]}

Mitani, Nobuyuki ^{[2
]}

Omori, Kiyohiro ^{[2
]}

Nakazono, Kaoru ^{[2
]}

机构：

[1] Kobe Univ, Grad Sch Syst Informat, Kobe, Hyogo, Japan

[2] Hyogo Inst Assist Technol, Kobe, Hyogo, Japan

来源：

17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES | 2016年

关键词：

multimodal; lip reading; deep-learning; assistive technology;

D O I：

10.21437/Interspeech.2016-721

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

In this paper, we propose an audio-visual speech recognition system for a person with an articulation disorder resulting from severe hearing loss. In the case of a person with this type of articulation disorder, the speech style is quite different from those of people without hearing loss that a speaker-independent acoustic model for unimpaired persons is hardly useful for recognizing it. The audio-visual speech recognition system we present in this paper is for a person with severe hearing loss in noisy environments. Although feature integration is an important factor in multimodal speech recognition, it is difficult to integrate efficiently because those features are different intrinsically. We propose a novel visual feature extraction approach that connects the lip image to audio features efficiently, and the use of convolutive bottleneck networks (CBNs) increases robustness with respect to speech fluctuations caused by hearing loss. The effectiveness of this approach was confirmed through word-recognition experiments in noisy environments, where the CBN-based feature extraction method outperformed the conventional methods.

引用

页码：277 / 281

页数：5

共 50 条

[1] Audio-visual speech recognition using convolutive bottleneck networks for a person with severe hearing loss
Takashima, Yuki
Kakihara, Yasuhiro
Aihara, Ryo
Takiguchi, Tetsuya
Ariki, Yasuo
Mitani, Nobuyuki
Omori, Kiyohiro
Nakazono, Kaoru
[J]. IPSJ Transactions on Computer Vision and Applications, 2015, 7 : 64 - 68
[2] Integration of Deep Bottleneck Features for Audio-Visual Speech Recognition
Ninomiya, Hiroshi
Kitaoka, Norihide
Tamura, Satoshi
Iribe, Yurie
Takeda, Kazuya
[J]. 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 563 - 567
[3] Audio-visual modeling for bimodal speech recognition
Kaynak, MN
Zhi, Q
Cheok, AD
Sengupta, K
Chung, KC
[J]. 2001 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-5: E-SYSTEMS AND E-MAN FOR CYBERNETICS IN CYBERSPACE, 2002, : 181 - 186
[4] Bimodal fusion in audio-visual speech recognition
Zhang, XZ
Mersereau, RM
Clements, M
[J]. 2002 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL I, PROCEEDINGS, 2002, : 964 - 967
[5] Audio-visual speech recognition using deep bottleneck features and high-performance lipreading
Tamura, Satoshi
Ninomiya, Hiroshi
Kitaoka, Norihide
Osuga, Shin
Iribe, Yurie
Takeda, Kazuya
Hayamizu, Satoru
[J]. 2015 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2015, : 575 - 582
[6] Audio-visual speech recognition using MPEGA compliant visual features
Aleksic, PS
Williams, JJ
Wu, ZL
Katsaggelos, AK
[J]. EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING, 2002, 2002 (11) : 1213 - 1227
[7] Two-Level Bimodal Association for Audio-Visual Speech Recognition
Lee, Jong-Seok
Ebrahimi, Touradj
[J]. ADVANCED CONCEPTS FOR INTELLIGENT VISION SYSTEMS, PROCEEDINGS, 2009, 5807 : 133 - 144
[8] Audio-Visual Speech Recognition Using MPEG-4 Compliant Visual Features
Petar S. Aleksic
Jay J. Williams
Zhilin Wu
Aggelos K. Katsaggelos
[J]. EURASIP Journal on Advances in Signal Processing, 2002
[9] A Robust Audio-visual Speech Recognition Using Audio-visual Voice Activity Detection
Tamura, Satoshi
Ishikawa, Masato
Hashiba, Takashi
Takeuchi, Shin'ichi
Hayamizu, Satoru
[J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2702 - +
[10] Depth-based Features in Audio-Visual Speech Recognition
Palecek, Karel
Chaloupka, Josef
[J]. 2016 39TH INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS AND SIGNAL PROCESSING (TSP), 2016, : 303 - 306

← 1 2 3 4 5 →