SPLAT: Speech-Language Joint Pre-Training for Spoken Language Understanding

被引:0
|
作者
Chung, Yu-An [1 ]
Zhu, Chenguang [2 ]
Zeng, Michael [2 ]
机构
[1] MIT, Comp Sci & Artificial Intelligence Lab, Cambridge, MA 02139 USA
[2] Microsoft Cognit Serv Grp, Redmond, WA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Spoken language understanding (SLU) requires a model to analyze input acoustic signal to understand its linguistic content and make predictions. To boost the models' performance, various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text. However, the inherent disparities between the two modalities necessitate a mutual analysis. In this paper, we propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules. Besides conducting a self-supervised masked language modeling task on the two individual modules using unpaired speech and text, SPLAT aligns representations from the two modules in a shared latent space using a small amount of paired speech and text. Thus, during fine-tuning, the speech module alone can produce representations carrying both acoustic information and contextual semantic knowledge of an input acoustic signal. Experimental results verify the effectiveness of our approach on various SLU tasks. For example, SPLAT improves the previous state-of-the-art performance on the Spoken SQuAD dataset by more than 10%.
引用
收藏
页码:1897 / 1907
页数:11
相关论文
共 50 条
  • [31] Speech Characteristics in Female Students Training to Be Speech-Language Pathologists
    D'haeseleer, Evelien
    De Ley, Sophia
    Cosyns, Marjan
    Desomer, Els
    De Mesel, Jasmien
    Van Maele, George
    Van Lierde, Kristiane
    FOLIA PHONIATRICA ET LOGOPAEDICA, 2016, 68 (04) : 167 - 174
  • [32] Adaptive Training for Robust Spoken Language Understanding
    Garcia, Fernando
    Sanchis, Emilio
    Hurtado, Lluis-F.
    Segarra, Encarna
    PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS, COMPUTER VISION, AND APPLICATIONS, CIARP 2015, 2015, 9423 : 519 - 526
  • [33] Pre-training Language Models for Comparative Reasoning
    Yu, Mengxia
    Zhang, Zhihan
    Yu, Wenhao
    Jiang, Meng
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 12421 - 12433
  • [34] Sigmoid Loss for Language Image Pre-Training
    Zhai, Xiaohua
    Mustafa, Basil
    Kolesnikov, Alexander
    Beyer, Lucas
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11941 - 11952
  • [35] MarkupLM: Pre-training of Text and Markup Language for Visually Rich Document Understanding
    Li, Junlong
    Xu, Yiheng
    Cui, Lei
    Wei, Furu
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 6078 - 6087
  • [36] Grounded Language-Image Pre-training
    Li, Liunian Harold
    Zhang, Pengchuan
    Zhang, Haotian
    Yang, Jianwei
    Li, Chunyuan
    Zhong, Yiwu
    Wang, Lijuan
    Yuan, Lu
    Zhang, Lei
    Hwang, Jenq-Neng
    Chang, Kai-Wei
    Gao, Jianfeng
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 10955 - 10965
  • [37] VILA: On Pre-training for Visual Language Models
    Lin, Ji
    Yin, Hongxu
    Ping, Wei
    Molchanov, Pavlo
    Shoeybi, Mohammad
    Han, Song
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 26679 - 26689
  • [38] RELATION ENHANCED VISION LANGUAGE PRE-TRAINING
    Lee, Ju-Hee
    Kang, Je-Won
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 2286 - 2290
  • [39] Bootstrapping Vision-Language Learning with Decoupled Language Pre-training
    Jian, Yiren
    Gao, Chongyang
    Vosoughi, Soroush
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [40] Pre-Training Language Models for Identifying Patronizing and Condescending Language: An Analysis
    Perez-Almendros, Carla
    Espinosa-Anke, Luis
    Schockaert, Steven
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 3902 - 3911