LOOK, LISTEN, AND DECODE: MULTIMODAL SPEECH RECOGNITION WITH IMAGES

被引：0

作者：

Sun, Felix ^{[1
]}

Harwath, David ^{[1
]}

Glass, James ^{[1
]}

机构：

[1] MIT, Comp Sci & Artificial Intelligence Lab, 77 Massachusetts Ave, Cambridge, MA 02139 USA

来源：

2016 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2016) | 2016年

关键词：

Multimodal speech recognition; image captioning; CNN; lattices;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper, we introduce a multimodal speech recognition scenario, in which an image provides contextual information for a spoken caption to be decoded. We investigate a lattice rescoring algorithm that integrates information from the image at two different points: the image is used to augment the language model with the most likely words, and to rescore the top hypotheses using a word-level RNN. This rescoring mechanism decreases the word error rate by 3 absolute percentage points, compared to a baseline speech recognizer operating with only the speech recording.

引用

页码：573 / 578

页数：6

共 50 条

[1] SPEECH RECOGNITION - MACHINES THAT LISTEN
SCHALK, TB
ROBOTICS AGE, 1985, 7 (04): : 11 - 14
[2] Look, Listen and Learn - A Multimodal LSTM for Speaker Identification
Ren, Jimmy
Hu, Yongtao
Tai, Yu-Wing
Wang, Chuan
Xu, Li
Sun, Wenxiu
Yan, Qiong
THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2016, : 3581 - 3587
[3] Listen to Look: Action Recognition by Previewing Audio
Gao, Ruohan
Oh, Tae-Hyun
Grauman, Kristen
Torresani, Lorenzo
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 10454 - 10464
[4] Multimodal Speech Recognition Using Mouth Images from Depth Camera
Yasui, Yuki
Inoue, Nakamasa
Iwano, Koji
Shinoda, Koichi
2017 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC 2017), 2017, : 1274 - 1277
[5] TERMINALS, LISTEN UP, SPEECH RECOGNITION IS A REALITY
SCHALK, TB
VAN MEIR, EG
COMPUTER DESIGN, 1983, 22 (10): : 97 - +
[6] Multimodal systems for speech recognition
Mamyrbayev, Orken Zh
Alimhan, Keylan
Amirgaliyev, Beibut
Zhumazhanov, Bagashar
Mussayeva, Dinara
Gusmanova, Farida
INTERNATIONAL JOURNAL OF MOBILE COMMUNICATIONS, 2020, 18 (03) : 314 - 326
[7] Multimodal recognition of speech and electrocorticogram
Ahuja, Mitali
Komeiji, Shuji
Mitsuhashi, Takumi
Iimura, Yasushi
Suzuki, Hiroharu
Sugano, Hidenori
Shinoda, Koichi
Tanaka, Toshihisa
2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 546 - 550
[8] Look, feel, listen or look, listen, feel?
Harris, Samar
Naina, Harris V. K.
Kuppachi, Sarat
AMERICAN JOURNAL OF MEDICINE, 2007, 120 (02):
[9] TERMINALS, LISTEN UP, SPEECH RECOGNITION IS A REALITY.
Schalk, Thomas B.
Van Meir, Elizabeth L.
Electronic Systems Technology and Design/Computer Design's, 1983, 22 (10): : 97 - 100
[10] Automatic speech recognition lets machines listen and comprehend
Kempainen, S
EDN, 1997, 42 (05) : 73 - 80

← 1 2 3 4 5 →