SPEECHCLIP: INTEGRATING SPEECH WITH PRE-TRAINED VISION AND LANGUAGE MODEL

被引:3
|
作者
Shih, Yi-Jen [1 ]
Wang, Hsuan-Fu [1 ]
Chang, Heng-Jui [1 ,2 ]
Berry, Layne [3 ]
Lee, Hung-yi [1 ]
Harwath, David [3 ]
机构
[1] Natl Taiwan Univ, Taipei, Taiwan
[2] MIT CSAIL, Cambridge, MA USA
[3] Univ Texas Austin, Austin, TX 78712 USA
关键词
Visual grounding; vision and language; self-supervised learning; REPRESENTATION;
D O I
10.1109/SLT54892.2023.10022954
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data-driven speech processing models usually perform well with a large amount of text supervision, but collecting transcribed speech data is costly. Therefore, we propose SpeechCLIP, a novel framework bridging speech and text through images to enhance speech models without transcriptions. We leverage state-of-the-art pre-trained HuBERT and CLIP, aligning them via paired images and spoken captions with minimal fine-tuning. SpeechCLIP outperforms prior stateof-the-art on image-speech retrieval and performs zero-shot speech-text retrieval without direct supervision from transcriptions. Moreover, SpeechCLIP can directly retrieve semantically related keywords from speech.
引用
收藏
页码:715 / 722
页数:8
相关论文
共 50 条
  • [1] Idiom Cloze Algorithm Integrating with Pre-trained Language Model
    Ju, Sheng-Gen
    Huang, Fang-Yi
    Sun, Jie-Ping
    [J]. Ruan Jian Xue Bao/Journal of Software, 2022, 33 (10): : 3793 - 3805
  • [2] Leveraging Pre-trained Language Model for Speech Sentiment Analysis
    Shon, Suwon
    Brusco, Pablo
    Pan, Jing
    Han, Kyu J.
    Watanabe, Shinji
    [J]. INTERSPEECH 2021, 2021, : 3420 - 3424
  • [3] Comparing Pre-Trained Language Model for Arabic Hate Speech Detection
    Daouadi, Kheir Eddine
    Boualleg, Yaakoub
    Guehairia, Oussama
    [J]. COMPUTACION Y SISTEMAS, 2024, 28 (02): : 681 - 693
  • [4] Hyperbolic Pre-Trained Language Model
    Chen, Weize
    Han, Xu
    Lin, Yankai
    He, Kaichen
    Xie, Ruobing
    Zhou, Jie
    Liu, Zhiyuan
    Sun, Maosong
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 3101 - 3112
  • [5] Vision Enhanced Generative Pre-trained Language Model for Multimodal Sentence Summarization
    Jing, Liqiang
    Li, Yiren
    Xu, Junhao
    Yu, Yongcan
    Shen, Pei
    Song, Xuemeng
    [J]. MACHINE INTELLIGENCE RESEARCH, 2023, 20 (02) : 289 - 298
  • [6] Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model
    Xing, Yinghui
    Wu, Qirui
    Cheng, De
    Zhang, Shizhou
    Liang, Guoqiang
    Wang, Peng
    Zhang, Yanning
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 2056 - 2068
  • [7] Pre-trained Language Model Representations for Language Generation
    Edunov, Sergey
    Baevski, Alexei
    Auli, Michael
    [J]. 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 4052 - 4059
  • [8] Adder Encoder for Pre-trained Language Model
    Ding, Jianbang
    Zhang, Suiyun
    Li, Linlin
    [J]. CHINESE COMPUTATIONAL LINGUISTICS, CCL 2023, 2023, 14232 : 339 - 347
  • [9] Interactive Design by Integrating a Large Pre-Trained Language Model and Building Information Modeling
    Jang, Suhyung
    Lee, Ghang
    [J]. COMPUTING IN CIVIL ENGINEERING 2023-VISUALIZATION, INFORMATION MODELING, AND SIMULATION, 2024, : 291 - 299
  • [10] GENERATING HUMAN READABLE TRANSCRIPT FOR AUTOMATIC SPEECH RECOGNITION WITH PRE-TRAINED LANGUAGE MODEL
    Liao, Junwei
    Shi, Yu
    Gong, Ming
    Shou, Linjun
    Eskimez, Sefik
    Lu, Liyang
    Qu, Hong
    Zeng, Michael
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7578 - 7582