SPEECHCLIP: INTEGRATING SPEECH WITH PRE-TRAINED VISION AND LANGUAGE MODEL

被引:6
|
作者
Shih, Yi-Jen [1 ]
Wang, Hsuan-Fu [1 ]
Chang, Heng-Jui [1 ,2 ]
Berry, Layne [3 ]
Lee, Hung-yi [1 ]
Harwath, David [3 ]
机构
[1] Natl Taiwan Univ, Taipei, Taiwan
[2] MIT CSAIL, Cambridge, MA USA
[3] Univ Texas Austin, Austin, TX 78712 USA
关键词
Visual grounding; vision and language; self-supervised learning; REPRESENTATION;
D O I
10.1109/SLT54892.2023.10022954
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data-driven speech processing models usually perform well with a large amount of text supervision, but collecting transcribed speech data is costly. Therefore, we propose SpeechCLIP, a novel framework bridging speech and text through images to enhance speech models without transcriptions. We leverage state-of-the-art pre-trained HuBERT and CLIP, aligning them via paired images and spoken captions with minimal fine-tuning. SpeechCLIP outperforms prior stateof-the-art on image-speech retrieval and performs zero-shot speech-text retrieval without direct supervision from transcriptions. Moreover, SpeechCLIP can directly retrieve semantically related keywords from speech.
引用
收藏
页码:715 / 722
页数:8
相关论文
共 50 条
  • [41] Learning and Evaluating a Differentially Private Pre-trained Language Model
    Hoory, Shlomo
    Feder, Amir
    Tendler, Avichai
    Cohen, Alon
    Erell, Sofia
    Laish, Itay
    Nakhost, Hootan
    Stemmer, Uri
    Benjamini, Ayelet
    Hassidim, Avinatan
    Matias, Yossi
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 1178 - 1189
  • [42] TextPruner: A Model Pruning Toolkit for Pre-Trained Language Models
    Yang, Ziqing
    Cui, Yiming
    Chen, Zhigang
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022): PROCEEDINGS OF SYSTEM DEMONSTRATIONS, 2022, : 35 - 43
  • [43] Multimodal Search on Iconclass using Vision-Language Pre-Trained Models
    Santini, Cristian
    Posthumus, Etienne
    Tietz, Tabea
    Tan, Mary Ann
    Bruns, Oleksandra
    Sack, Harald
    2023 ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES, JCDL, 2023, : 285 - 287
  • [44] CPT: Colorful Prompt Tuning for pre-trained vision-language models
    Yao, Yuan
    Zhang, Ao
    Zhang, Zhengyan
    Liu, Zhiyuan
    Chua, Tat-Seng
    Sun, Maosong
    AI OPEN, 2024, 5 : 30 - 38
  • [45] Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization
    Yu, Tiezheng
    Dai, Wenliang
    Liu, Zihan
    Fung, Pascale
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 3995 - 4007
  • [46] AraXLNet: pre-trained language model for sentiment analysis of Arabic
    Alhanouf Alduailej
    Abdulrahman Alothaim
    Journal of Big Data, 9
  • [47] Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners
    Park, Keon-Hee
    Song, Kyungwoo
    Park, Gyeong-Moon
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 23881 - 23890
  • [48] A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-Language Model
    Xu, Mengde
    Zhang, Zheng
    Wei, Fangyun
    Lin, Yutong
    Cao, Yue
    Hu, Han
    Bai, Xiang
    COMPUTER VISION, ECCV 2022, PT XXIX, 2022, 13689 : 736 - 753
  • [49] Leveraging Vision-Language Pre-Trained Model and Contrastive Learning for Enhanced Multimodal Sentiment Analysis
    An, Jieyu
    Zainon, Wan Mohd Nazmee Wan
    Ding, Binfen
    INTELLIGENT AUTOMATION AND SOFT COMPUTING, 2023, 37 (02): : 1673 - 1689
  • [50] Constraint embedding for prompt tuning in vision-language pre-trained modelConstraint embedding for prompt tuning in vision-language pre-trained modelK. Cheng et al.
    Keyang Cheng
    Liutao Wei
    Jingfeng Tang
    Yongzhao Zhan
    Multimedia Systems, 2025, 31 (1)