End-to-end ASR to jointly predict transcriptions and linguistic annotations

被引:0
|
作者
Omachi, Motoi [1 ]
Fujita, Yuya [1 ]
Watanabe, Shinji [2 ]
Wiesner, Matthew [2 ]
机构
[1] Yahoo Japan Corp, Tokyo, Japan
[2] Jhons Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose a Transformer-based sequence-to-sequence model for automatic speech recognition (ASR) capable of simultaneously transcribing and annotating audio with linguistic information such as phonemic transcripts or part-of-speech (POS) tags. Since linguistic information is important in natural language processing (NLP), the proposed ASR is especially useful for speech interface applications, including spoken dialogue systems and speech translation, which combine ASR and NLP. To produce linguistic annotations, we train the ASR system using modified training targets: each grapheme or multi-grapheme unit in the target transcript is followed by an aligned phoneme sequence and/or POS tag. Since our method has access to the underlying audio data, we can estimate linguistic annotations more accurately than pipeline approaches in which NLP-based methods are applied to a hypothesized ASR transcript. Experimental results on Japanese and English datasets show that the proposed ASR system is capable of simultaneously producing high-quality transcriptions and linguistic annotations.
引用
收藏
页码:1861 / 1871
页数:11
相关论文
共 50 条
  • [1] DOES SPEECH ENHANCEMENTWORK WITH END-TO-END ASR OBJECTIVES?: EXPERIMENTAL ANALYSIS OF MULTICHANNEL END-TO-END ASR
    Ochiai, Tsubasa
    Watanabe, Shinji
    Katagiri, Shigeru
    [J]. 2017 IEEE 27TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, 2017,
  • [2] Streaming End-to-End ASR Using CTC Decoder and DRA for Linguistic Information Substitution
    Takagi, Tatsunari
    Ogawa, Atsunori
    Kitaoka, Norihide
    Wakabayashi, Yukoh
    [J]. 2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 1779 - 1783
  • [3] Towards Lifelong Learning of End-to-end ASR
    Chang, Heng-Jui
    Lee, Hung-yi
    Lee, Lin-shan
    [J]. INTERSPEECH 2021, 2021, : 2551 - 2555
  • [4] Contextual Biasing for End-to-End Chinese ASR
    Zhang, Kai
    Zhang, Qiuxia
    Wang, Chung-Che
    Jang, Jyh-Shing Roger
    [J]. IEEE ACCESS, 2024, 12 : 92960 - 92975
  • [5] Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict
    Higuchi, Yosuke
    Watanabe, Shinji
    Chen, Nanxin
    Ogawa, Tetsuji
    Kobayashi, Tetsunori
    [J]. INTERSPEECH 2020, 2020, : 3655 - 3659
  • [6] UNSUPERVISED MODEL ADAPTATION FOR END-TO-END ASR
    Sivaraman, Ganesh
    Casal, Ricardo
    Garland, Matt
    Khoury, Elie
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6987 - 6991
  • [7] End-to-End Topic Classification without ASR
    Dong, Zexian
    Liu, Jia
    Zhang, Wei-Qiang
    [J]. 2019 IEEE 19TH INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND INFORMATION TECHNOLOGY (ISSPIT 2019), 2019,
  • [8] Phonemic competition in end-to-end ASR models
    ten Bosch, Louis
    Bentum, Martijn
    Boves, Lou
    [J]. INTERSPEECH 2023, 2023, : 586 - 590
  • [9] ASR-AWARE END-TO-END NEURAL DIARIZATION
    Khare, Aparna
    Han, Eunjung
    Yang, Yuguang
    Stolcke, Andreas
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8092 - 8096
  • [10] End-to-End Speaker-Attributed ASR with Transformer
    Kanda, Naoyuki
    Ye, Guoli
    Gaur, Yashesh
    Wang, Xiaofei
    Meng, Zhong
    Chen, Zhuo
    Yoshioka, Takuya
    [J]. INTERSPEECH 2021, 2021, : 4413 - 4417