SPEECHCLIP: INTEGRATING SPEECH WITH PRE-TRAINED VISION AND LANGUAGE MODEL

被引：6

作者：

Shih, Yi-Jen ^{[1
]}

Wang, Hsuan-Fu ^{[1
]}

Chang, Heng-Jui ^{[1
,2
]}

Berry, Layne ^{[3
]}

Lee, Hung-yi ^{[1
]}

Harwath, David ^{[3
]}

机构：

[1] Natl Taiwan Univ, Taipei, Taiwan

[2] MIT CSAIL, Cambridge, MA USA

[3] Univ Texas Austin, Austin, TX 78712 USA

来源：

2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT | 2022年

关键词：

Visual grounding; vision and language; self-supervised learning; REPRESENTATION;

D O I：

10.1109/SLT54892.2023.10022954

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Data-driven speech processing models usually perform well with a large amount of text supervision, but collecting transcribed speech data is costly. Therefore, we propose SpeechCLIP, a novel framework bridging speech and text through images to enhance speech models without transcriptions. We leverage state-of-the-art pre-trained HuBERT and CLIP, aligning them via paired images and spoken captions with minimal fine-tuning. SpeechCLIP outperforms prior stateof-the-art on image-speech retrieval and performs zero-shot speech-text retrieval without direct supervision from transcriptions. Moreover, SpeechCLIP can directly retrieve semantically related keywords from speech.

引用

页码：715 / 722

页数：8

共 50 条

[1] Integrating Pre-Trained Language Model With Physical Layer Communications
Lee, Ju-Hyung
Lee, Dong-Ho
Lee, Joohan
Pujara, Jay
IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, 2024, 23 (11) : 17266 - 17278
[2] Idiom Cloze Algorithm Integrating with Pre-trained Language Model
Ju S.-G.
Huang F.-Y.
Sun J.-P.
Ruan Jian Xue Bao/Journal of Software, 2022, 33 (10): : 3793 - 3805
[3] Leveraging Pre-trained Language Model for Speech Sentiment Analysis
Shon, Suwon
Brusco, Pablo
Pan, Jing
Han, Kyu J.
Watanabe, Shinji
INTERSPEECH 2021, 2021, : 3420 - 3424
[4] CLIP-Llama: A New Approach for Scene Text Recognition with a Pre-Trained Vision-Language Model and a Pre-Trained Language Model
Zhao, Xiaoqing
Xu, Miaomiao
Silamu, Wushour
Li, Yanbing
SENSORS, 2024, 24 (22)
[5] Comparing Pre-Trained Language Model for Arabic Hate Speech Detection
Daouadi, Kheir Eddine
Boualleg, Yaakoub
Guehairia, Oussama
COMPUTACION Y SISTEMAS, 2024, 28 (02): : 681 - 693
[6] Hyperbolic Pre-Trained Language Model
Chen, Weize
Han, Xu
Lin, Yankai
He, Kaichen
Xie, Ruobing
Zhou, Jie
Liu, Zhiyuan
Sun, Maosong
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 3101 - 3112
[7] Constraint embedding for prompt tuning in vision-language pre-trained model
Cheng, Keyang
Wei, Liutao
Tang, Jingfeng
Zhan, Yongzhao
MULTIMEDIA SYSTEMS, 2025, 31 (01)
[8] Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model
Xing, Yinghui
Wu, Qirui
Cheng, De
Zhang, Shizhou
Liang, Guoqiang
Wang, Peng
Zhang, Yanning
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 2056 - 2068
[9] Vision Enhanced Generative Pre-trained Language Model for Multimodal Sentence Summarization
Jing, Liqiang
Li, Yiren
Xu, Junhao
Yu, Yongcan
Shen, Pei
Song, Xuemeng
MACHINE INTELLIGENCE RESEARCH, 2023, 20 (02) : 289 - 298
[10] Pre-trained Language Model Representations for Language Generation
Edunov, Sergey
Baevski, Alexei
Auli, Michael
2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 4052 - 4059

← 1 2 3 4 5 →