SPEECHCLIP: INTEGRATING SPEECH WITH PRE-TRAINED VISION AND LANGUAGE MODEL

被引:3
|
作者
Shih, Yi-Jen [1 ]
Wang, Hsuan-Fu [1 ]
Chang, Heng-Jui [1 ,2 ]
Berry, Layne [3 ]
Lee, Hung-yi [1 ]
Harwath, David [3 ]
机构
[1] Natl Taiwan Univ, Taipei, Taiwan
[2] MIT CSAIL, Cambridge, MA USA
[3] Univ Texas Austin, Austin, TX 78712 USA
关键词
Visual grounding; vision and language; self-supervised learning; REPRESENTATION;
D O I
10.1109/SLT54892.2023.10022954
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data-driven speech processing models usually perform well with a large amount of text supervision, but collecting transcribed speech data is costly. Therefore, we propose SpeechCLIP, a novel framework bridging speech and text through images to enhance speech models without transcriptions. We leverage state-of-the-art pre-trained HuBERT and CLIP, aligning them via paired images and spoken captions with minimal fine-tuning. SpeechCLIP outperforms prior stateof-the-art on image-speech retrieval and performs zero-shot speech-text retrieval without direct supervision from transcriptions. Moreover, SpeechCLIP can directly retrieve semantically related keywords from speech.
引用
收藏
页码:715 / 722
页数:8
相关论文
共 50 条
  • [21] Enhancing Language Generation with Effective Checkpoints of Pre-trained Language Model
    Park, Jeonghyeok
    Zhao, Hai
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 2686 - 2694
  • [22] Integrating Knowledge Graph Embeddings and Pre-trained Language Models in Hypercomplex Spaces
    Nayyeri, Mojtaba
    Wang, Zihao
    Akter, Mst. Mahfuja
    Alam, Mirza Mohtashim
    Rony, Md Rashad Al Hasan
    Lehmann, Jens
    Staab, Steffen
    [J]. SEMANTIC WEB, ISWC 2023, PART I, 2023, 14265 : 388 - 407
  • [23] Few-Shot NLG with Pre-Trained Language Model
    Chen, Zhiyu
    Eavani, Harini
    Chen, Wenhu
    Liu, Yinyin
    Wang, William Yang
    [J]. 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 183 - 190
  • [24] SsciBERT: a pre-trained language model for social science texts
    Si Shen
    Jiangfeng Liu
    Litao Lin
    Ying Huang
    Lin Zhang
    Chang Liu
    Yutong Feng
    Dongbo Wang
    [J]. Scientometrics, 2023, 128 : 1241 - 1263
  • [25] A Pre-trained Clinical Language Model for Acute Kidney Injury
    Mao, Chengsheng
    Yao, Liang
    Luo, Yuan
    [J]. 2020 8TH IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS (ICHI 2020), 2020, : 531 - 532
  • [26] Knowledge Enhanced Pre-trained Language Model for Product Summarization
    Yin, Wenbo
    Ren, Junxiang
    Wu, Yuejiao
    Song, Ruilin
    Liu, Lang
    Cheng, Zhen
    Wang, Sibo
    [J]. NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, NLPCC 2022, PT II, 2022, 13552 : 263 - 273
  • [27] ConfliBERT: A Pre-trained Language Model for Political Conflict and Violence
    Hu, Yibo
    Hosseini, MohammadSaleh
    Parolin, Erick Skorupa
    Osorio, Javier
    Khan, Latifur
    Brandt, Patrick T.
    D'Orazio, Vito J.
    [J]. NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 5469 - 5482
  • [28] IndicBART: A Pre-trained Model for Indic Natural Language Generation
    Dabre, Raj
    Shrotriya, Himani
    Kunchukuttan, Anoop
    Puduppully, Ratish
    Khapra, Mitesh M.
    Kumar, Pratyush
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 1849 - 1863
  • [29] Pre-trained Language Model based Ranking in Baidu Search
    Zou, Lixin
    Zhang, Shengqiang
    Cai, Hengyi
    Ma, Dehong
    Cheng, Suqi
    Wang, Shuaiqiang
    Shi, Daiting
    Cheng, Zhicong
    Yin, Dawei
    [J]. KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 4014 - 4022
  • [30] Software Vulnerabilities Detection Based on a Pre-trained Language Model
    Xu, Wenlin
    Li, Tong
    Wang, Jinsong
    Duan, Haibo
    Tang, Yahui
    [J]. 2023 IEEE 22ND INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, BIGDATASE, CSE, EUC, ISCI 2023, 2024, : 904 - 911