Cross-modal Transfer Learning via Multi-grained Alignment for End-to-End Spoken Language Understanding

被引:0
|
作者
Zhu, Yi [1 ,2 ]
Wang, Zexun [1 ]
Liu, Hang [1 ]
Wang, Peiying [1 ]
Feng, Mingchao [1 ]
Chen, Meng [1 ]
He, Xiaodong [1 ]
机构
[1] JD AI, Beijing, Peoples R China
[2] Univ Cambridge, LTL, Cambridge, England
来源
关键词
spoken language understanding; cross-modal transfer learning; cross attention; contrastive learning;
D O I
10.21437/Interspeech.2022-11378
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
End-to-end spoken language understanding (E2E-SLU) has witnessed impressive improvements through cross-modal (text-to-audio) transfer learning. However, current methods mostly focus on coarse-grained sequence-level text-to-audio knowledge transfer with simple loss, and neglecting the fine-grained temporal alignment between the two modalities. In this work, we propose a novel multi-grained cross-modal transfer learning framework for E2E-SLU. Specifically, we devise a cross attention module to align the tokens of text with the frame features of speech, encouraging the model to target at the salient acoustic features attended to each token during transferring the semantic information. We also leverage contrastive learning to facilitate cross-modal representation learning in sentence level. Finally, we explore various data augmentation methods to mitigate the deficiency of large amount of labelled data for the training of E2E-SLU. Extensive experiments are conducted on both English and Chinese SLU datasets to verify the effectiveness of our proposed approach. Experimental results and detailed analyses demonstrate the superiority and competitiveness of our model.
引用
收藏
页码:1131 / 1135
页数:5
相关论文
共 50 条
  • [1] Pretrained Semantic Speech Embeddings for End-to-End Spoken Language Understanding via Cross-Modal Teacher-Student Learning
    Denisov, Pavel
    Vu, Ngoc Thang
    [J]. INTERSPEECH 2020, 2020, : 881 - 885
  • [2] Exploring Transfer Learning For End-to-End Spoken Language Understanding
    Rongali, Subendhu
    Liu, Beiye
    Cai, Liwei
    Arkoudas, Konstantine
    Su, Chengwei
    Hamza, Wael
    [J]. THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 13754 - 13761
  • [3] TIE YOUR EMBEDDINGS DOWN: CROSS-MODAL LATENT SPACES FOR END-TO-END SPOKEN LANGUAGE UNDERSTANDING
    Agrawal, Bhuvan
    Muller, Markus
    Choudhary, Samridhi
    Radfar, Martin
    Mouchtaris, Athanasios
    McGowan, Ross
    Susanj, Nathan
    Kunzmann, Siegfried
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7157 - 7161
  • [4] ST-BERT: CROSS-MODAL LANGUAGE MODEL PRE-TRAINING FOR END-TO-END SPOKEN LANGUAGE UNDERSTANDING
    Kim, Minjeong
    Kim, Gyuwan
    Lee, Sang-Woo
    Ha, Jung-Woo
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7478 - 7482
  • [5] Multi-grained Representation Learning for Cross-modal Retrieval
    Zhao, Shengwei
    Xu, Linhai
    Liu, Yuying
    Du, Shaoyi
    [J]. PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 2194 - 2198
  • [6] Investigating Adaptation and Transfer Learning for End-to-End Spoken Language Understanding from Speech
    Tomashenko, Natalia
    Caubriere, Antoine
    Esteve, Yannick
    [J]. INTERSPEECH 2019, 2019, : 824 - 828
  • [7] Towards Spoken Language Understanding via Multi-level Multi-grained Contrastive Learning
    Cheng, Xuxin
    Xu, Wanshi
    Zhu, Zhihong
    Li, Hongxiang
    Zou, Yuexian
    [J]. PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023, 2023, : 326 - 336
  • [8] TOWARDS END-TO-END SPOKEN LANGUAGE UNDERSTANDING
    Serdyuk, Dmitriy
    Wang, Yongqiang
    Fuegen, Christian
    Kumar, Anuj
    Liu, Baiyang
    Bengio, Yoshua
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5754 - 5758
  • [9] Improving Speech Translation by Cross-Modal Multi-Grained Contrastive Learning
    Zhang, Hao
    Si, Nianwen
    Chen, Yaqi
    Zhang, Wenlin
    Yang, Xukui
    Qu, Dan
    Zhang, Wei-Qiang
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 1075 - 1086
  • [10] End-to-end Speech Translation via Cross-modal Progressive Training
    Ye, Rong
    Wang, Mingxuan
    Li, Lei
    [J]. INTERSPEECH 2021, 2021, : 2267 - 2271