Semantic Speech Retrieval With a Visually Grounded Model of Untranscribed Speech

被引:27
|
作者
Kamper, Herman [1 ]
Shakhnarovich, Gregory [2 ]
Livescu, Karen [2 ]
机构
[1] Stellenbosch Univ, Dept Elect & Elect Engn, ZA-7599 Stellenbosch, South Africa
[2] TTI Chicago, Chicago, IL 60637 USA
基金
美国国家科学基金会;
关键词
Visual grounding; multimodal modelling; speech retrieval; semantic retrieval; keyword spotting; SPOKEN CONTENT; WORD SEGMENTATION; RECOGNITION; INFORMATION; ACQUISITION; DISCOVERY;
D O I
10.1109/TASLP.2018.2872106
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
There is a growing interest in models that can learn from unlabelled speech paired with visual context. This setting is relevant for low-resource speech processing, robotics, and human language acquisition research. Here, we study how a visually grounded speech model, trained on images of scenes paired with spoken captions, captures aspects of semantics. We use an external image tagger to generate soft text labels from images, which serve as targets for a neural model that maps untranscribed speech to (semantic) keyword labels. We introduce a newly collected data set of human semantic relevance judgements and an associated task, semantic speech retrieval, where the goal is to search for spoken utterances that are semantically relevant to a given text query. Without seeing any text, the model trained on parallel speech and images achieves a precision of almost 60% on its top ten semantic retrievals. Compared to a supervised model trained on transcriptions, our model matches human judgements better by some measures, especially in retrieving non-verbatim semantic matches. We perform an extensive analysis of the model and its resulting representations.
引用
收藏
页码:89 / 98
页数:10
相关论文
共 50 条
  • [1] Keyword Localisation in Untranscribed Speech Using Visually Grounded Speech Models
    Olaleye, Kayode
    Oneata, Dan
    Kamper, Herman
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1454 - 1466
  • [2] Visually grounded learning of keyword prediction from untranscribed speech
    Kamper, Herman
    Settle, Shane
    Shakhnarovich, Gregory
    Livescu, Karen
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3677 - 3681
  • [3] Representations of language in a model of visually grounded speech signal
    Chrupala, Grzegorz
    Gelderloos, Lieke
    Alishahi, Afra
    [J]. PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 613 - 622
  • [4] TRILINGUAL SEMANTIC EMBEDDINGS OF VISUALLY GROUNDED SPEECH WITH SELF-ATTENTION MECHANISMS
    Ohishi, Yasunori
    Kimura, Akisato
    Kawanishi, Takahito
    Kashino, Kunio
    Harwath, David
    Glass, James
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4352 - 4356
  • [5] VISION AS AN INTERLINGUA: LEARNING MULTILINGUAL SEMANTIC EMBEDDINGS OF UNTRANSCRIBED SPEECH
    Harwath, David
    Chuang, Galen
    Glass, James
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4969 - 4973
  • [6] SPEECH-TO-SPEECH TRANSLATION BETWEEN UNTRANSCRIBED UNKNOWN LANGUAGES
    Tjandra, Andros
    Sakti, Sakriani
    Nakamura, Satoshi
    [J]. 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 593 - 600
  • [7] Learning to Recognise Words using Visually Grounded Speech
    Scholten, Sebastiaan
    Merkx, Danny
    Scharenborg, Odette
    [J]. 2021 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 2021,
  • [8] Visually Grounded Speech Models Have a Mutual Exclusivity Bias
    Nortje, Leanne
    Oneata, Dan
    Matusevych, Yevgen
    Kamper, Herman
    [J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2024, 12 : 755 - 770
  • [9] Pair Expansion for Learning Multilingual Semantic Embeddings using Disjoint Visually-grounded Speech Audio Datasets
    Ohishi, Yasunori
    Kimura, Akisato
    Kawanishi, Takahito
    Kashino, Kunio
    Harwath, David
    Glass, James
    [J]. INTERSPEECH 2020, 2020, : 1486 - 1490
  • [10] Modelling Human Word Learning and Recognition Using Visually Grounded Speech
    Merkx, Danny
    Scholten, Sebastiaan
    Frank, Stefan L.
    Ernestus, Mirjam
    Scharenborg, Odette
    [J]. COGNITIVE COMPUTATION, 2023, 15 (01) : 272 - 288