SPEECHCLIP: INTEGRATING SPEECH WITH PRE-TRAINED VISION AND LANGUAGE MODEL

被引:6
|
作者
Shih, Yi-Jen [1 ]
Wang, Hsuan-Fu [1 ]
Chang, Heng-Jui [1 ,2 ]
Berry, Layne [3 ]
Lee, Hung-yi [1 ]
Harwath, David [3 ]
机构
[1] Natl Taiwan Univ, Taipei, Taiwan
[2] MIT CSAIL, Cambridge, MA USA
[3] Univ Texas Austin, Austin, TX 78712 USA
关键词
Visual grounding; vision and language; self-supervised learning; REPRESENTATION;
D O I
10.1109/SLT54892.2023.10022954
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data-driven speech processing models usually perform well with a large amount of text supervision, but collecting transcribed speech data is costly. Therefore, we propose SpeechCLIP, a novel framework bridging speech and text through images to enhance speech models without transcriptions. We leverage state-of-the-art pre-trained HuBERT and CLIP, aligning them via paired images and spoken captions with minimal fine-tuning. SpeechCLIP outperforms prior stateof-the-art on image-speech retrieval and performs zero-shot speech-text retrieval without direct supervision from transcriptions. Moreover, SpeechCLIP can directly retrieve semantically related keywords from speech.
引用
收藏
页码:715 / 722
页数:8
相关论文
共 50 条
  • [31] Few-Shot NLG with Pre-Trained Language Model
    Chen, Zhiyu
    Eavani, Harini
    Chen, Wenhu
    Liu, Yinyin
    Wang, William Yang
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 183 - 190
  • [32] Pre-Trained Language Models and Their Applications
    Wang, Haifeng
    Li, Jiwei
    Wu, Hua
    Hovy, Eduard
    Sun, Yu
    ENGINEERING, 2023, 25 : 51 - 65
  • [33] IndicBART: A Pre-trained Model for Indic Natural Language Generation
    Dabre, Raj
    Shrotriya, Himani
    Kunchukuttan, Anoop
    Puduppully, Ratish
    Khapra, Mitesh M.
    Kumar, Pratyush
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 1849 - 1863
  • [34] ConfliBERT: A Pre-trained Language Model for Political Conflict and Violence
    Hu, Yibo
    Hosseini, MohammadSaleh
    Parolin, Erick Skorupa
    Osorio, Javier
    Khan, Latifur
    Brandt, Patrick T.
    D'Orazio, Vito J.
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 5469 - 5482
  • [35] Software Vulnerabilities Detection Based on a Pre-trained Language Model
    Xu, Wenlin
    Li, Tong
    Wang, Jinsong
    Duan, Haibo
    Tang, Yahui
    2023 IEEE 22ND INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, BIGDATASE, CSE, EUC, ISCI 2023, 2024, : 904 - 911
  • [36] Pre-trained Language Model based Ranking in Baidu Search
    Zou, Lixin
    Zhang, Shengqiang
    Cai, Hengyi
    Ma, Dehong
    Cheng, Suqi
    Wang, Shuaiqiang
    Shi, Daiting
    Cheng, Zhicong
    Yin, Dawei
    KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 4014 - 4022
  • [37] AraXLNet: pre-trained language model for sentiment analysis of Arabic
    Alduailej, Alhanouf
    Alothaim, Abdulrahman
    JOURNAL OF BIG DATA, 2022, 9 (01)
  • [38] A survey of text classification based on pre-trained language model
    Wu, Yujia
    Wan, Jun
    NEUROCOMPUTING, 2025, 616
  • [39] SsciBERT: a pre-trained language model for social science texts
    Shen, Si
    Liu, Jiangfeng
    Lin, Litao
    Huang, Ying
    Zhang, Lin
    Liu, Chang
    Feng, Yutong
    Wang, Dongbo
    SCIENTOMETRICS, 2023, 128 (02) : 1241 - 1263
  • [40] Interpretability of Entity Matching Based on Pre-trained Language Model
    Liang Z.
    Wang H.-Z.
    Dai J.-J.
    Shao X.-Y.
    Ding X.-O.
    Mu T.-Y.
    Ruan Jian Xue Bao/Journal of Software, 2023, 34 (03): : 1087 - 1108