LOOK, LISTEN, AND DECODE: MULTIMODAL SPEECH RECOGNITION WITH IMAGES

被引:0
|
作者
Sun, Felix [1 ]
Harwath, David [1 ]
Glass, James [1 ]
机构
[1] MIT, Comp Sci & Artificial Intelligence Lab, 77 Massachusetts Ave, Cambridge, MA 02139 USA
关键词
Multimodal speech recognition; image captioning; CNN; lattices;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we introduce a multimodal speech recognition scenario, in which an image provides contextual information for a spoken caption to be decoded. We investigate a lattice rescoring algorithm that integrates information from the image at two different points: the image is used to augment the language model with the most likely words, and to rescore the top hypotheses using a word-level RNN. This rescoring mechanism decreases the word error rate by 3 absolute percentage points, compared to a baseline speech recognizer operating with only the speech recording.
引用
收藏
页码:573 / 578
页数:6
相关论文
共 50 条
  • [21] Look, Listen, and Attack: Backdoor Attacks Against Video Action Recognition
    Hammoud, Hasan Abed Al Kader
    Liu, Shuming
    Alkhrashi, Mohammed
    AlBalawi, Fahad
    Ghanem, Bernard
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW, 2024, : 3439 - 3450
  • [22] Listen and Look: Audio-Visual Matching Assisted Speech Source Separation
    Lu, Rui
    Duan, Zhiyao
    Zhang, Changshui
    IEEE SIGNAL PROCESSING LETTERS, 2018, 25 (09) : 1315 - 1319
  • [23] Speech recognition for command entry in multimodal interaction
    Tyfa, DA
    Howes, M
    INTERNATIONAL JOURNAL OF HUMAN-COMPUTER STUDIES, 2000, 52 (04) : 637 - 667
  • [24] AVAS: Speech Database for Multimodal Recognition Applications
    Antar, Samar
    Sagheer, Alaa
    Aly, Saleh
    Tolba, Mohamed F.
    2013 13TH INTERNATIONAL CONFERENCE ON HYBRID INTELLIGENT SYSTEMS (HIS), 2013, : 123 - 128
  • [25] Multimodal speech recognition for unmanned aerial vehicles
    Oneață, Dan
    Cucu, Horia
    Computers and Electrical Engineering, 2021, 90
  • [26] Multimodal English corpus for automatic speech recognition
    Kunka, Bartosz
    Kupryjanow, Adam
    Dalka, Piotr
    Bratoszewski, Piotr
    Szczodrak, Maciej
    Spaleniak, Pawel
    Szykulski, Marcin
    Czyzewski, Andrzej
    2013 SIGNAL PROCESSING: ALGORITHMS, ARCHITECTURES, ARRANGEMENTS, AND APPLICATIONS (SPA), 2013, : 106 - 111
  • [27] Towards the explainability of Multimodal Speech Emotion Recognition
    Kumar, Puneet
    Kaushik, Vishesh
    Raman, Balasubramanian
    INTERSPEECH 2021, 2021, : 1748 - 1752
  • [28] Temporal Multimodal Learning in Audiovisual Speech Recognition
    Hu, Di
    Li, Xuelong
    Lu, Xiaoqiang
    2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 3574 - 3582
  • [29] CONTINUOUS VISUAL SPEECH RECOGNITION FOR MULTIMODAL FUSION
    Benhaim, Eric
    Sahbi, Hichem
    Vitte, Guillaume
    2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
  • [30] Multimodal speech recognition for unmanned aerial vehicles
    Oneata, Dan
    Cucu, Horia
    COMPUTERS & ELECTRICAL ENGINEERING, 2021, 90