SPEECHCLIP: INTEGRATING SPEECH WITH PRE-TRAINED VISION AND LANGUAGE MODEL

被引:3
|
作者
Shih, Yi-Jen [1 ]
Wang, Hsuan-Fu [1 ]
Chang, Heng-Jui [1 ,2 ]
Berry, Layne [3 ]
Lee, Hung-yi [1 ]
Harwath, David [3 ]
机构
[1] Natl Taiwan Univ, Taipei, Taiwan
[2] MIT CSAIL, Cambridge, MA USA
[3] Univ Texas Austin, Austin, TX 78712 USA
关键词
Visual grounding; vision and language; self-supervised learning; REPRESENTATION;
D O I
10.1109/SLT54892.2023.10022954
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data-driven speech processing models usually perform well with a large amount of text supervision, but collecting transcribed speech data is costly. Therefore, we propose SpeechCLIP, a novel framework bridging speech and text through images to enhance speech models without transcriptions. We leverage state-of-the-art pre-trained HuBERT and CLIP, aligning them via paired images and spoken captions with minimal fine-tuning. SpeechCLIP outperforms prior stateof-the-art on image-speech retrieval and performs zero-shot speech-text retrieval without direct supervision from transcriptions. Moreover, SpeechCLIP can directly retrieve semantically related keywords from speech.
引用
收藏
页码:715 / 722
页数:8
相关论文
共 50 条
  • [41] Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization
    Yu, Tiezheng
    Dai, Wenliang
    Liu, Zihan
    Fung, Pascale
    [J]. 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 3995 - 4007
  • [42] AraXLNet: pre-trained language model for sentiment analysis of Arabic
    Alhanouf Alduailej
    Abdulrahman Alothaim
    [J]. Journal of Big Data, 9
  • [43] A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-Language Model
    Xu, Mengde
    Zhang, Zheng
    Wei, Fangyun
    Lin, Yutong
    Cao, Yue
    Hu, Han
    Bai, Xiang
    [J]. COMPUTER VISION, ECCV 2022, PT XXIX, 2022, 13689 : 736 - 753
  • [44] Leveraging Vision-Language Pre-Trained Model and Contrastive Learning for Enhanced Multimodal Sentiment Analysis
    An, Jieyu
    Zainon, Wan Mohd Nazmee Wan
    Ding, Binfen
    [J]. INTELLIGENT AUTOMATION AND SOFT COMPUTING, 2023, 37 (02): : 1673 - 1689
  • [45] Integrating Pre-trained Model into Rule-based Dialogue Management
    Quan, Jun
    Yang, Meng
    Gan, Qiang
    Xiong, Deyi
    Liu, Yiming
    Dong, Yuchen
    Ouyang, Fangxin
    Tian, Jun
    Deng, Ruiling
    Li, Yongzhi
    Yang, Yang
    Jiang, Daxin
    [J]. THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 16097 - 16099
  • [46] ARoBERT: An ASR Robust Pre-Trained Language Model for Spoken Language Understanding
    Wang, Chengyu
    Dai, Suyang
    Wang, Yipeng
    Yang, Fei
    Qiu, Minghui
    Chen, Kehan
    Zhou, Wei
    Huang, Jun
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1207 - 1218
  • [47] Generalization of vision pre-trained models for histopathology
    Milad Sikaroudi
    Maryam Hosseini
    Ricardo Gonzalez
    Shahryar Rahnamayan
    H. R. Tizhoosh
    [J]. Scientific Reports, 13
  • [48] Generalization of vision pre-trained models for histopathology
    Sikaroudi, Milad
    Hosseini, Maryam
    Gonzalez, Ricardo
    Rahnamayan, Shahryar
    Tizhoosh, H. R.
    [J]. SCIENTIFIC REPORTS, 2023, 13 (01)
  • [49] Automatic Speech Recognition Dataset Augmentation with Pre-Trained Model and Script
    Kwon, Minsu
    Choi, Ho-Jin
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP), 2019, : 649 - 651
  • [50] Automatic Prosody Annotation with Pre-Trained Text-Speech Model
    Dai, Ziqian
    Yu, Jianwei
    Wang, Yan
    Chen, Nuo
    Bian, Yanyao
    Li, Guangzhi
    Cai, Deng
    Yu, Dong
    [J]. INTERSPEECH 2022, 2022, : 5513 - 5517