LOOK, LISTEN, AND DECODE: MULTIMODAL SPEECH RECOGNITION WITH IMAGES

被引：0

作者：

Sun, Felix ^{[1
]}

Harwath, David ^{[1
]}

Glass, James ^{[1
]}

机构：

[1] MIT, Comp Sci & Artificial Intelligence Lab, 77 Massachusetts Ave, Cambridge, MA 02139 USA

来源：

2016 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2016) | 2016年

关键词：

Multimodal speech recognition; image captioning; CNN; lattices;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper, we introduce a multimodal speech recognition scenario, in which an image provides contextual information for a spoken caption to be decoded. We investigate a lattice rescoring algorithm that integrates information from the image at two different points: the image is used to augment the language model with the most likely words, and to rescore the top hypotheses using a word-level RNN. This rescoring mechanism decreases the word error rate by 3 absolute percentage points, compared to a baseline speech recognizer operating with only the speech recording.

引用

页码：573 / 578

页数：6

共 50 条

[21] Look, Listen, and Attack: Backdoor Attacks Against Video Action Recognition
Hammoud, Hasan Abed Al Kader
Liu, Shuming
Alkhrashi, Mohammed
AlBalawi, Fahad
Ghanem, Bernard
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW, 2024, : 3439 - 3450
[22] Listen and Look: Audio-Visual Matching Assisted Speech Source Separation
Lu, Rui
Duan, Zhiyao
Zhang, Changshui
IEEE SIGNAL PROCESSING LETTERS, 2018, 25 (09) : 1315 - 1319
[23] Speech recognition for command entry in multimodal interaction
Tyfa, DA
Howes, M
INTERNATIONAL JOURNAL OF HUMAN-COMPUTER STUDIES, 2000, 52 (04) : 637 - 667
[24] AVAS: Speech Database for Multimodal Recognition Applications
Antar, Samar
Sagheer, Alaa
Aly, Saleh
Tolba, Mohamed F.
2013 13TH INTERNATIONAL CONFERENCE ON HYBRID INTELLIGENT SYSTEMS (HIS), 2013, : 123 - 128
[25] Multimodal speech recognition for unmanned aerial vehicles
Oneață, Dan
Cucu, Horia
Computers and Electrical Engineering, 2021, 90
[26] Multimodal English corpus for automatic speech recognition
Kunka, Bartosz
Kupryjanow, Adam
Dalka, Piotr
Bratoszewski, Piotr
Szczodrak, Maciej
Spaleniak, Pawel
Szykulski, Marcin
Czyzewski, Andrzej
2013 SIGNAL PROCESSING: ALGORITHMS, ARCHITECTURES, ARRANGEMENTS, AND APPLICATIONS (SPA), 2013, : 106 - 111
[27] Towards the explainability of Multimodal Speech Emotion Recognition
Kumar, Puneet
Kaushik, Vishesh
Raman, Balasubramanian
INTERSPEECH 2021, 2021, : 1748 - 1752
[28] Temporal Multimodal Learning in Audiovisual Speech Recognition
Hu, Di
Li, Xuelong
Lu, Xiaoqiang
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 3574 - 3582
[29] CONTINUOUS VISUAL SPEECH RECOGNITION FOR MULTIMODAL FUSION
Benhaim, Eric
Sahbi, Hichem
Vitte, Guillaume
2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
[30] Multimodal speech recognition for unmanned aerial vehicles
Oneata, Dan
Cucu, Horia
COMPUTERS & ELECTRICAL ENGINEERING, 2021, 90

← 1 2 3 4 5 →