Exploring Transfer Learning For End-to-End Spoken Language Understanding

被引:0
|
作者
Rongali, Subendhu [1 ,2 ]
Liu, Beiye [1 ]
Cai, Liwei [1 ]
Arkoudas, Konstantine [1 ]
Su, Chengwei [1 ]
Hamza, Wael [1 ]
机构
[1] Amazon Alexa AI, New York, NY 10003 USA
[2] Univ Massachusetts, Amherst, MA 01003 USA
关键词
SPEECH;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Voice Assistants such as Alexa, Siri, and Google Assistant typically use a two-stage Spoken Language Understanding pipeline; first, an Automatic Speech Recognition (ASR) component to process customer speech and generate text transcriptions, followed by a Natural Language Understanding (NLU) component to map transcriptions to an actionable hypothesis. An end-to-end (E2E) system that goes directly from speech to a hypothesis is a more attractive option. These systems were shown to be smaller, faster, and better optimized. However, they require massive amounts of end-to-end training data and in addition, don't take advantage of the already available ASR and NLU training data. In this work, we propose an E2E system that is designed to jointly train on multiple speech-to-text tasks, such as ASR (speech-transcription) and SLU (speech-hypothesis), and text-to-text tasks, such as NLU (text-hypothesis). We call this the Audio-Text All-Task (AT-AT) Model and we show that it beats the performance of E2E models trained on individual tasks, especially ones trained on limited data. We show this result on an internal music dataset and two public datasets, FluentSpeech and SNIPS Audio, where we achieve state-of-the-art results. Since our model can process both speech and text input sequences and learn to predict a target sequence, it also allows us to do zero-shot E2E SLU by training on only text-hypothesis data (without any speech) from a new domain. We evaluate this ability of our model on the Facebook TOP dataset and set a new benchmark for zeroshot E2E performance. We release the audio data collected for the TOP dataset for future research.
引用
收藏
页码:13754 / 13761
页数:8
相关论文
共 50 条
  • [1] Investigating Adaptation and Transfer Learning for End-to-End Spoken Language Understanding from Speech
    Tomashenko, Natalia
    Caubriere, Antoine
    Esteve, Yannick
    [J]. INTERSPEECH 2019, 2019, : 824 - 828
  • [2] TOWARDS END-TO-END SPOKEN LANGUAGE UNDERSTANDING
    Serdyuk, Dmitriy
    Wang, Yongqiang
    Fuegen, Christian
    Kumar, Anuj
    Liu, Baiyang
    Bengio, Yoshua
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5754 - 5758
  • [3] Semantic Complexity in End-to-End Spoken Language Understanding
    McKenna, Joseph P.
    Choudhary, Samridhi
    Saxon, Michael
    Strimel, Grant P.
    Mouchtaris, Athanasios
    [J]. INTERSPEECH 2020, 2020, : 4273 - 4277
  • [4] A Streaming End-to-End Framework For Spoken Language Understanding
    Potdar, Nihal
    Avila, Anderson R.
    Xing, Chao
    Wang, Dong
    Cao, Yiran
    Chen, Xiao
    [J]. PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, 2021, : 3906 - 3914
  • [5] WhiSLU: End-to-End Spoken Language Understanding with Whisper
    Wang, Minghan
    Li, Yinglu
    Guo, Jiaxin
    Qiao, Xiaosong
    Li, Zongyao
    Shang, Hengchao
    Wei, Daimeng
    Tao, Shimin
    Zhang, Min
    Yang, Hao
    [J]. INTERSPEECH 2023, 2023, : 770 - 774
  • [6] Curriculum-based transfer learning for an effective end-to-end spoken language understanding and domain portability
    Caubriere, Antoine
    Tomashenko, Natalia
    Laurent, Antoine
    Morin, Emmanuel
    Camelin, Nathalie
    Esteve, Yannick
    [J]. INTERSPEECH 2019, 2019, : 1198 - 1202
  • [7] End-to-End Spoken Language Understanding for Generalized Voice Assistants
    Saxon, Michael
    Choudhary, Samridhi
    McKenna, Joseph P.
    Mouchtaris, Athanasios
    [J]. INTERSPEECH 2021, 2021, : 4738 - 4742
  • [8] End-to-End Spoken Language Understanding Without Full Transcripts
    Kuo, Hong-Kwang J.
    Tuske, Zoltan
    Thomas, Samuel
    Huang, Yinghui
    Audhkhasi, Kartik
    Kingsbury, Brian
    Kurata, Gakuto
    Kons, Zvi
    Hoory, Ron
    Lastras, Luis
    [J]. INTERSPEECH 2020, 2020, : 906 - 910
  • [9] Privacy-Preserving End-to-End Spoken Language Understanding
    Wang, Yinggui
    Huang, Wei
    Yang, Le
    [J]. PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 5224 - 5232
  • [10] ERROR ANALYSIS APPLIED TO END-TO-END SPOKEN LANGUAGE UNDERSTANDING
    Caubriere, Antoine
    Ghannay, Sahar
    Tomashenko, Natalia
    De Mori, Renato
    Laurent, Antoine
    Morin, Emmanuel
    Esteve, Yannick
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 8514 - 8518