SPEECHCLIP: INTEGRATING SPEECH WITH PRE-TRAINED VISION AND LANGUAGE MODEL

被引：6

作者：

Shih, Yi-Jen ^{[1
]}

Wang, Hsuan-Fu ^{[1
]}

Chang, Heng-Jui ^{[1
,2
]}

Berry, Layne ^{[3
]}

Lee, Hung-yi ^{[1
]}

Harwath, David ^{[3
]}

机构：

[1] Natl Taiwan Univ, Taipei, Taiwan

[2] MIT CSAIL, Cambridge, MA USA

[3] Univ Texas Austin, Austin, TX 78712 USA

来源：

2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT | 2022年

关键词：

Visual grounding; vision and language; self-supervised learning; REPRESENTATION;

D O I：

10.1109/SLT54892.2023.10022954

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Data-driven speech processing models usually perform well with a large amount of text supervision, but collecting transcribed speech data is costly. Therefore, we propose SpeechCLIP, a novel framework bridging speech and text through images to enhance speech models without transcriptions. We leverage state-of-the-art pre-trained HuBERT and CLIP, aligning them via paired images and spoken captions with minimal fine-tuning. SpeechCLIP outperforms prior stateof-the-art on image-speech retrieval and performs zero-shot speech-text retrieval without direct supervision from transcriptions. Moreover, SpeechCLIP can directly retrieve semantically related keywords from speech.

引用

页码：715 / 722

页数：8

共 50 条

[21] Improving Pre-trained Vision-and-Language Embeddings for Phrase Grounding
Dou, Zi-Yi
Peng, Nanyun
2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 6362 - 6371
[22] Comparing pre-trained language models for Spanish hate speech detection
Miriam Plaza-del-Arco, Flor
Dolores Molina-Gonzalez, M.
Alfonso Urena-Lopez, L.
Teresa Martin-Valdivia, M.
EXPERT SYSTEMS WITH APPLICATIONS, 2021, 166
[23] Migratable urban street scene sensing method based on vision language pre-trained model
Zhang, Yan
Zhang, Fan
Chen, Nengcheng
INTERNATIONAL JOURNAL OF APPLIED EARTH OBSERVATION AND GEOINFORMATION, 2022, 113
[24] How to Estimate Model Transferability of Pre-Trained Speech Models?
Chen, Zih-Ching
Yang, Chao-Han Huck
Li, Bo
Zhang, Yu
Chen, Nanxin
Chang, Shou-Yiin
Prabhavalkar, Rohit
Lee, Hung-yi
Sainath, Tara N.
INTERSPEECH 2023, 2023, : 456 - 460
[25] Bridging the Gap: Integrating Pre-trained Speech Enhancement and Recognition Models for Robust Speech Recognition
Wang, Kuan-Chen
Li, You-Jin
Chen, Wei-Lun
Chen, Yu-Wen
Wang, Yi-Ching
Yeh, Ping-Cheng
Zhang, Chao
Tsao, Yu
32ND EUROPEAN SIGNAL PROCESSING CONFERENCE, EUSIPCO 2024, 2024, : 426 - 430
[26] Enhancing Language Generation with Effective Checkpoints of Pre-trained Language Model
Park, Jeonghyeok
Zhao, Hai
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 2686 - 2694
[27] Integrating Knowledge Graph Embeddings and Pre-trained Language Models in Hypercomplex Spaces
Nayyeri, Mojtaba
Wang, Zihao
Akter, Mst. Mahfuja
Alam, Mirza Mohtashim
Rony, Md Rashad Al Hasan
Lehmann, Jens
Staab, Steffen
SEMANTIC WEB, ISWC 2023, PART I, 2023, 14265 : 388 - 407
[28] SsciBERT: a pre-trained language model for social science texts
Si Shen
Jiangfeng Liu
Litao Lin
Ying Huang
Lin Zhang
Chang Liu
Yutong Feng
Dongbo Wang
Scientometrics, 2023, 128 : 1241 - 1263
[29] A Pre-trained Clinical Language Model for Acute Kidney Injury
Mao, Chengsheng
Yao, Liang
Luo, Yuan
2020 8TH IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS (ICHI 2020), 2020, : 531 - 532
[30] Knowledge Enhanced Pre-trained Language Model for Product Summarization
Yin, Wenbo
Ren, Junxiang
Wu, Yuejiao
Song, Ruilin
Liu, Lang
Cheng, Zhen
Wang, Sibo
NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, NLPCC 2022, PT II, 2022, 13552 : 263 - 273

← 1 2 3 4 5 →