Semantic Speech Retrieval With a Visually Grounded Model of Untranscribed Speech

被引：27

作者：

Kamper, Herman ^{[1
]}

Shakhnarovich, Gregory ^{[2
]}

Livescu, Karen ^{[2
]}

机构：

[1] Stellenbosch Univ, Dept Elect & Elect Engn, ZA-7599 Stellenbosch, South Africa

[2] TTI Chicago, Chicago, IL 60637 USA

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2019年 / 27卷 / 01期

基金：

美国国家科学基金会;

关键词：

Visual grounding; multimodal modelling; speech retrieval; semantic retrieval; keyword spotting; SPOKEN CONTENT; WORD SEGMENTATION; RECOGNITION; INFORMATION; ACQUISITION; DISCOVERY;

D O I：

10.1109/TASLP.2018.2872106

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

There is a growing interest in models that can learn from unlabelled speech paired with visual context. This setting is relevant for low-resource speech processing, robotics, and human language acquisition research. Here, we study how a visually grounded speech model, trained on images of scenes paired with spoken captions, captures aspects of semantics. We use an external image tagger to generate soft text labels from images, which serve as targets for a neural model that maps untranscribed speech to (semantic) keyword labels. We introduce a newly collected data set of human semantic relevance judgements and an associated task, semantic speech retrieval, where the goal is to search for spoken utterances that are semantically relevant to a given text query. Without seeing any text, the model trained on parallel speech and images achieves a precision of almost 60% on its top ten semantic retrievals. Compared to a supervised model trained on transcriptions, our model matches human judgements better by some measures, especially in retrieving non-verbatim semantic matches. We perform an extensive analysis of the model and its resulting representations.

引用

页码：89 / 98

页数：10

共 50 条

[1] Keyword Localisation in Untranscribed Speech Using Visually Grounded Speech Models
Olaleye, Kayode
Oneata, Dan
Kamper, Herman
[J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1454 - 1466
[2] Visually grounded learning of keyword prediction from untranscribed speech
Kamper, Herman
Settle, Shane
Shakhnarovich, Gregory
Livescu, Karen
[J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3677 - 3681
[3] Representations of language in a model of visually grounded speech signal
Chrupala, Grzegorz
Gelderloos, Lieke
Alishahi, Afra
[J]. PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 613 - 622
[4] TRILINGUAL SEMANTIC EMBEDDINGS OF VISUALLY GROUNDED SPEECH WITH SELF-ATTENTION MECHANISMS
Ohishi, Yasunori
Kimura, Akisato
Kawanishi, Takahito
Kashino, Kunio
Harwath, David
Glass, James
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4352 - 4356
[5] VISION AS AN INTERLINGUA: LEARNING MULTILINGUAL SEMANTIC EMBEDDINGS OF UNTRANSCRIBED SPEECH
Harwath, David
Chuang, Galen
Glass, James
[J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4969 - 4973
[6] SPEECH-TO-SPEECH TRANSLATION BETWEEN UNTRANSCRIBED UNKNOWN LANGUAGES
Tjandra, Andros
Sakti, Sakriani
Nakamura, Satoshi
[J]. 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 593 - 600
[7] Learning to Recognise Words using Visually Grounded Speech
Scholten, Sebastiaan
Merkx, Danny
Scharenborg, Odette
[J]. 2021 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 2021,
[8] Visually Grounded Speech Models Have a Mutual Exclusivity Bias
Nortje, Leanne
Oneata, Dan
Matusevych, Yevgen
Kamper, Herman
[J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2024, 12 : 755 - 770
[9] Pair Expansion for Learning Multilingual Semantic Embeddings using Disjoint Visually-grounded Speech Audio Datasets
Ohishi, Yasunori
Kimura, Akisato
Kawanishi, Takahito
Kashino, Kunio
Harwath, David
Glass, James
[J]. INTERSPEECH 2020, 2020, : 1486 - 1490
[10] Modelling Human Word Learning and Recognition Using Visually Grounded Speech
Merkx, Danny
Scholten, Sebastiaan
Frank, Stefan L.
Ernestus, Mirjam
Scharenborg, Odette
[J]. COGNITIVE COMPUTATION, 2023, 15 (01) : 272 - 288

← 1 2 3 4 5 →