LOOK, LISTEN, AND DECODE: MULTIMODAL SPEECH RECOGNITION WITH IMAGES

被引：0

作者：

Sun, Felix ^{[1
]}

Harwath, David ^{[1
]}

Glass, James ^{[1
]}

机构：

[1] MIT, Comp Sci & Artificial Intelligence Lab, 77 Massachusetts Ave, Cambridge, MA 02139 USA

来源：

2016 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2016) | 2016年

关键词：

Multimodal speech recognition; image captioning; CNN; lattices;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper, we introduce a multimodal speech recognition scenario, in which an image provides contextual information for a spoken caption to be decoded. We investigate a lattice rescoring algorithm that integrates information from the image at two different points: the image is used to augment the language model with the most likely words, and to rescore the top hypotheses using a word-level RNN. This rescoring mechanism decreases the word error rate by 3 absolute percentage points, compared to a baseline speech recognizer operating with only the speech recording.

引用

页码：573 / 578

页数：6

共 50 条

[41] Look, Listen, Move
Hogan, Brian J.
MANUFACTURING ENGINEERING, 2010, 144 (01): : 6 - 6
[42] Learning To Look and Listen
Atwater, Reginald M.
AMERICAN JOURNAL OF PUBLIC HEALTH AND THE NATIONS HEALTH, 1951, 41 (09): : 1140 - 1140
[43] SHOP, LOOK AND LISTEN
不详
MANAGEMENT OF WORLD WASTES, 1985, 28 (07): : 4 - 4
[44] To listen, look and think
Gjersvik, Petter
TIDSSKRIFT FOR DEN NORSKE LAEGEFORENING, 2015, 135 (14) : 1217 - 1217
[45] Stop, look and listen
Dickie, RA
JOURNAL OF COATINGS TECHNOLOGY, 2000, 72 (908): : 7 - 7
[46] LOOK - LISTEN - READ
YABROFF, L
LIBRARY JOURNAL, 1960, 85 (10) : 1866 - 1868
[47] Learning to Look and Listen
不详
VOLTA REVIEW, 1951, 53 (09) : 432 - 432
[48] Stop, look, listen!
Hogan, BJ
MANUFACTURING ENGINEERING, 2005, 135 (03): : 12 - 12
[49] STOP - LOOK - LISTEN
HERSHENSON, BR
AMERICAN JOURNAL OF PHARMACEUTICAL EDUCATION, 1990, 54 (02) : 216 - 217
[50] Look, Listen and Learn
Arandjelovic, Relja
Zisserman, Andrew
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 609 - 617

← 1 2 3 4 5 →