Attention-Based Keyword Localisation in Speech using Visual Grounding

被引:6
|
作者
Olaleye, Kayode [1 ]
Kamper, Herman [1 ]
机构
[1] Stellenbosch Univ, Dept & Engn, Stellenbosch, South Africa
来源
基金
新加坡国家研究基金会;
关键词
multimodal modelling; keyword localisation; visual grounding; attention; word discovery;
D O I
10.21437/Interspeech.2021-435
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Visually grounded speech models learn from images paired with spoken captions. By tagging images with soft text labels using a trained visual classifier with a fixed vocabulary, previous work has shown that it is possible to train a model that can detect whether a particular text keyword occurs in speech utterances or not. Here we investigate whether visually grounded speech models can also do keyword localisation: predicting where, within an utterance, a given textual keyword occurs without any explicit text-based or alignment supervision. We specifically consider whether incorporating attention into a convolutional model is beneficial for localisation. Although absolute localisation performance with visually supervised models is still modest (compared to using unordered bag-of-word text labels for supervision), we show that attention provides a large gain in performance over previous visually grounded models. As in many other speech-image studies, we find that many of the incorrect localisations are due to semantic confusions, e.g. locating the word 'backstroke' for the query keyword 'swimming'.
引用
收藏
页码:2991 / 2995
页数:5
相关论文
共 50 条
  • [21] An Attention-based Regression Model for Grounding Textual Phrases in Images
    Endo, Ko
    Aono, Masaki
    Nichols, Eric
    Funakoshi, Kotaro
    [J]. PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 3995 - 4001
  • [22] Attention-Based Speech Enhancement Using Human Quality Perception Modeling
    Nayem, Khandokar Md.
    Williamson, Donald S.
    [J]. IEEE/ACM Transactions on Audio Speech and Language Processing, 2024, 32 : 250 - 260
  • [23] An Online Attention-Based Model for Speech Recognition
    Fan, Ruchao
    Zhou, Pan
    Chen, Wei
    Jia, Jia
    Liu, Gang
    [J]. INTERSPEECH 2019, 2019, : 4390 - 4394
  • [24] Decoding Visual Motions from EEG Using Attention-Based RNN
    Yang, Dongxu
    Liu, Yadong
    Zhou, Zongtan
    Yu, Yang
    Liang, Xinbin
    [J]. APPLIED SCIENCES-BASEL, 2020, 10 (16):
  • [25] Attention-based Deep Learning for Visual Servoing
    Wang, Bo
    Li, Yuan
    [J]. 2020 CHINESE AUTOMATION CONGRESS (CAC 2020), 2020, : 4388 - 4393
  • [26] Ascertaining Speech Emotion using Attention-based Convolutional Neural Network Framework
    Arya, Ashima
    Arya, Vaishali
    Kohli, Neha
    Sukhija, Namrata
    Ibrahim, Ashraf Osman
    Bharany, Salil
    Binzagr, Faisal
    Muchtar, Farkhana Binti
    Mamoun, Mohamed
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2023, 14 (11) : 614 - 622
  • [27] Siamese Attention-Based LSTM for Speech Emotion Recognition
    Nizamidin, Tashpolat
    Zhao, Li
    Liang, Ruiyu
    Xie, Yue
    Hamdulla, Askar
    [J]. IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES, 2020, E103A (07) : 937 - 941
  • [28] Towards Understanding Attention-Based Speech Recognition Models
    Qin, Chu-Xiong
    Qu, Dan
    [J]. IEEE ACCESS, 2020, 8 : 24358 - 24369
  • [29] ATTENTION-BASED SCALING ADAPTATION FOR TARGET SPEECH EXTRACTION
    Han, Jiangyu
    Rao, Wei
    Long, Yanhua
    Liang, Jiaen
    [J]. 2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 658 - 662
  • [30] Attention-Based Dense LSTM for Speech Emotion Recognition
    Xie, Yue
    Liang, Ruiyu
    Liang, Zhenlin
    Zhao, Li
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2019, E102D (07): : 1426 - 1429