Attention-Based Keyword Localisation in Speech using Visual Grounding

被引:6
|
作者
Olaleye, Kayode [1 ]
Kamper, Herman [1 ]
机构
[1] Stellenbosch Univ, Dept & Engn, Stellenbosch, South Africa
来源
基金
新加坡国家研究基金会;
关键词
multimodal modelling; keyword localisation; visual grounding; attention; word discovery;
D O I
10.21437/Interspeech.2021-435
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Visually grounded speech models learn from images paired with spoken captions. By tagging images with soft text labels using a trained visual classifier with a fixed vocabulary, previous work has shown that it is possible to train a model that can detect whether a particular text keyword occurs in speech utterances or not. Here we investigate whether visually grounded speech models can also do keyword localisation: predicting where, within an utterance, a given textual keyword occurs without any explicit text-based or alignment supervision. We specifically consider whether incorporating attention into a convolutional model is beneficial for localisation. Although absolute localisation performance with visually supervised models is still modest (compared to using unordered bag-of-word text labels for supervision), we show that attention provides a large gain in performance over previous visually grounded models. As in many other speech-image studies, we find that many of the incorrect localisations are due to semantic confusions, e.g. locating the word 'backstroke' for the query keyword 'swimming'.
引用
收藏
页码:2991 / 2995
页数:5
相关论文
共 50 条
  • [1] KeywordMap: Attention-based Visual Exploration for Keyword Analysis
    Tu, Yamei
    Xu, Jiayi
    Shen, Han-Wei
    [J]. 2021 IEEE 14TH PACIFIC VISUALIZATION SYMPOSIUM (PACIFICVIS 2021), 2021, : 206 - 215
  • [2] A visual attention-based keyword extraction for document classification
    Xing Wu
    Zhikang Du
    Yike Guo
    [J]. Multimedia Tools and Applications, 2018, 77 : 25355 - 25367
  • [3] A visual attention-based keyword extraction for document classification
    Wu, Xing
    Du, Zhikang
    Guo, Yike
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (19) : 25355 - 25367
  • [4] YFACC: A YORUBA SPEECH-IMAGE DATASET FOR CROSS-LINGUAL KEYWORD LOCALISATION THROUGH VISUAL GROUNDING
    Olaleye, Kayode
    Oneata, Dan
    Kamper, Herman
    [J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 731 - 738
  • [5] Source Code Summarization Using Attention-based Keyword Memory Networks
    Choi, YunSeok
    Kim, Suah
    Lee, Jee-Hyong
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP 2020), 2020, : 564 - 570
  • [6] Speech Emotion Classification Using Attention-Based LSTM
    Xie, Yue
    Liang, Ruiyu
    Liang, Zhenlin
    Huang, Chengwei
    Zou, Cairong
    Schuller, Bjoern
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (11) : 1675 - 1685
  • [7] ATTENTION-BASED SPEECH RECOGNITION USING GAZE INFORMATION
    Segawa, Osamu
    Hayashi, Tomoki
    Takeda, Kazuya
    [J]. 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 465 - 470
  • [8] Keyword Localisation in Untranscribed Speech Using Visually Grounded Speech Models
    Olaleye, Kayode
    Oneata, Dan
    Kamper, Herman
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1454 - 1466
  • [9] Visual Attention-based Watermarking
    Oakes, Matthew
    Bhowmik, Deepayan
    Abhayaratne, Charith
    [J]. 2011 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 2011, : 2653 - 2656
  • [10] Attention-based visual processes
    Cavanagh, P
    [J]. CANADIAN PSYCHOLOGY-PSYCHOLOGIE CANADIENNE, 1996, 37 (01): : 59 - 59