SPEECHCLIP: INTEGRATING SPEECH WITH PRE-TRAINED VISION AND LANGUAGE MODEL

被引:6
|
作者
Shih, Yi-Jen [1 ]
Wang, Hsuan-Fu [1 ]
Chang, Heng-Jui [1 ,2 ]
Berry, Layne [3 ]
Lee, Hung-yi [1 ]
Harwath, David [3 ]
机构
[1] Natl Taiwan Univ, Taipei, Taiwan
[2] MIT CSAIL, Cambridge, MA USA
[3] Univ Texas Austin, Austin, TX 78712 USA
关键词
Visual grounding; vision and language; self-supervised learning; REPRESENTATION;
D O I
10.1109/SLT54892.2023.10022954
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data-driven speech processing models usually perform well with a large amount of text supervision, but collecting transcribed speech data is costly. Therefore, we propose SpeechCLIP, a novel framework bridging speech and text through images to enhance speech models without transcriptions. We leverage state-of-the-art pre-trained HuBERT and CLIP, aligning them via paired images and spoken captions with minimal fine-tuning. SpeechCLIP outperforms prior stateof-the-art on image-speech retrieval and performs zero-shot speech-text retrieval without direct supervision from transcriptions. Moreover, SpeechCLIP can directly retrieve semantically related keywords from speech.
引用
收藏
页码:715 / 722
页数:8
相关论文
共 50 条
  • [21] Improving Pre-trained Vision-and-Language Embeddings for Phrase Grounding
    Dou, Zi-Yi
    Peng, Nanyun
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 6362 - 6371
  • [22] Comparing pre-trained language models for Spanish hate speech detection
    Miriam Plaza-del-Arco, Flor
    Dolores Molina-Gonzalez, M.
    Alfonso Urena-Lopez, L.
    Teresa Martin-Valdivia, M.
    EXPERT SYSTEMS WITH APPLICATIONS, 2021, 166
  • [23] Migratable urban street scene sensing method based on vision language pre-trained model
    Zhang, Yan
    Zhang, Fan
    Chen, Nengcheng
    INTERNATIONAL JOURNAL OF APPLIED EARTH OBSERVATION AND GEOINFORMATION, 2022, 113
  • [24] How to Estimate Model Transferability of Pre-Trained Speech Models?
    Chen, Zih-Ching
    Yang, Chao-Han Huck
    Li, Bo
    Zhang, Yu
    Chen, Nanxin
    Chang, Shou-Yiin
    Prabhavalkar, Rohit
    Lee, Hung-yi
    Sainath, Tara N.
    INTERSPEECH 2023, 2023, : 456 - 460
  • [25] Bridging the Gap: Integrating Pre-trained Speech Enhancement and Recognition Models for Robust Speech Recognition
    Wang, Kuan-Chen
    Li, You-Jin
    Chen, Wei-Lun
    Chen, Yu-Wen
    Wang, Yi-Ching
    Yeh, Ping-Cheng
    Zhang, Chao
    Tsao, Yu
    32ND EUROPEAN SIGNAL PROCESSING CONFERENCE, EUSIPCO 2024, 2024, : 426 - 430
  • [26] Enhancing Language Generation with Effective Checkpoints of Pre-trained Language Model
    Park, Jeonghyeok
    Zhao, Hai
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 2686 - 2694
  • [27] Integrating Knowledge Graph Embeddings and Pre-trained Language Models in Hypercomplex Spaces
    Nayyeri, Mojtaba
    Wang, Zihao
    Akter, Mst. Mahfuja
    Alam, Mirza Mohtashim
    Rony, Md Rashad Al Hasan
    Lehmann, Jens
    Staab, Steffen
    SEMANTIC WEB, ISWC 2023, PART I, 2023, 14265 : 388 - 407
  • [28] SsciBERT: a pre-trained language model for social science texts
    Si Shen
    Jiangfeng Liu
    Litao Lin
    Ying Huang
    Lin Zhang
    Chang Liu
    Yutong Feng
    Dongbo Wang
    Scientometrics, 2023, 128 : 1241 - 1263
  • [29] A Pre-trained Clinical Language Model for Acute Kidney Injury
    Mao, Chengsheng
    Yao, Liang
    Luo, Yuan
    2020 8TH IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS (ICHI 2020), 2020, : 531 - 532
  • [30] Knowledge Enhanced Pre-trained Language Model for Product Summarization
    Yin, Wenbo
    Ren, Junxiang
    Wu, Yuejiao
    Song, Ruilin
    Liu, Lang
    Cheng, Zhen
    Wang, Sibo
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, NLPCC 2022, PT II, 2022, 13552 : 263 - 273