LOOK, LISTEN, AND DECODE: MULTIMODAL SPEECH RECOGNITION WITH IMAGES

被引:0
|
作者
Sun, Felix [1 ]
Harwath, David [1 ]
Glass, James [1 ]
机构
[1] MIT, Comp Sci & Artificial Intelligence Lab, 77 Massachusetts Ave, Cambridge, MA 02139 USA
关键词
Multimodal speech recognition; image captioning; CNN; lattices;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we introduce a multimodal speech recognition scenario, in which an image provides contextual information for a spoken caption to be decoded. We investigate a lattice rescoring algorithm that integrates information from the image at two different points: the image is used to augment the language model with the most likely words, and to rescore the top hypotheses using a word-level RNN. This rescoring mechanism decreases the word error rate by 3 absolute percentage points, compared to a baseline speech recognizer operating with only the speech recording.
引用
收藏
页码:573 / 578
页数:6
相关论文
共 50 条
  • [1] SPEECH RECOGNITION - MACHINES THAT LISTEN
    SCHALK, TB
    ROBOTICS AGE, 1985, 7 (04): : 11 - 14
  • [2] Look, Listen and Learn - A Multimodal LSTM for Speaker Identification
    Ren, Jimmy
    Hu, Yongtao
    Tai, Yu-Wing
    Wang, Chuan
    Xu, Li
    Sun, Wenxiu
    Yan, Qiong
    THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2016, : 3581 - 3587
  • [3] Listen to Look: Action Recognition by Previewing Audio
    Gao, Ruohan
    Oh, Tae-Hyun
    Grauman, Kristen
    Torresani, Lorenzo
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 10454 - 10464
  • [4] Multimodal Speech Recognition Using Mouth Images from Depth Camera
    Yasui, Yuki
    Inoue, Nakamasa
    Iwano, Koji
    Shinoda, Koichi
    2017 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC 2017), 2017, : 1274 - 1277
  • [5] TERMINALS, LISTEN UP, SPEECH RECOGNITION IS A REALITY
    SCHALK, TB
    VAN MEIR, EG
    COMPUTER DESIGN, 1983, 22 (10): : 97 - +
  • [6] Multimodal systems for speech recognition
    Mamyrbayev, Orken Zh
    Alimhan, Keylan
    Amirgaliyev, Beibut
    Zhumazhanov, Bagashar
    Mussayeva, Dinara
    Gusmanova, Farida
    INTERNATIONAL JOURNAL OF MOBILE COMMUNICATIONS, 2020, 18 (03) : 314 - 326
  • [7] Multimodal recognition of speech and electrocorticogram
    Ahuja, Mitali
    Komeiji, Shuji
    Mitsuhashi, Takumi
    Iimura, Yasushi
    Suzuki, Hiroharu
    Sugano, Hidenori
    Shinoda, Koichi
    Tanaka, Toshihisa
    2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 546 - 550
  • [8] Look, feel, listen or look, listen, feel?
    Harris, Samar
    Naina, Harris V. K.
    Kuppachi, Sarat
    AMERICAN JOURNAL OF MEDICINE, 2007, 120 (02):
  • [9] TERMINALS, LISTEN UP, SPEECH RECOGNITION IS A REALITY.
    Schalk, Thomas B.
    Van Meir, Elizabeth L.
    Electronic Systems Technology and Design/Computer Design's, 1983, 22 (10): : 97 - 100
  • [10] Automatic speech recognition lets machines listen and comprehend
    Kempainen, S
    EDN, 1997, 42 (05) : 73 - 80