Speaker voice normalization for end-to-end speech translation

被引:1
|
作者
Xue, Zhengshan [1 ]
Shi, Tingxun
Zhang, Xiaolei
Xiong, Deyi [1 ]
机构
[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin, Peoples R China
关键词
Machine translation; Speech translation; Speaker normalization;
D O I
10.1016/j.eswa.2024.123317
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speaker voices exhibit acoustic variation. Our preliminary experiments reveal that normalized voice can significantly improve end -to -end speech translation. To mitigate the negative impact of acoustic voice variation across speakers on speech translation, we propose SVN-ST, a Speaker -Voice -Normalized end -to -end Speech Translation framework. In SVN-ST, we use synthetic speech inputs generated from a Text -to -Speech system to complement raw speech inputs. In order to explore synthetic speech inputs, we introduce two essential components for SVN-ST: an alignment adapter at the encoder side and a normalized speech knowledge distillation module at the decoder side. The former forces the representations of raw speech inputs to be close to those of synthetic (normalized) speech inputs while the latter attempts to guide the translations of raw speech inputs with those yielded from synthetic speech inputs. Two additional losses are also defined to equip with the two components. Experimental results on the MuST-C benchmark dataset demonstrate that SVN-ST outperforms previous state-of-the-art end -to -end non -normalized speech translation systems by 0.4 BLEU and cascaded speech translation systems by 2.3 BLEU. On the Covost 2 testset, SVN-ST also outperforms other normalized speech methods on robustness. Further analyses suggest that our model effectively aligns speech representations from different speakers, enhances robustness, and significantly improves sentence -level translation quality.
引用
收藏
页数:11
相关论文
共 50 条
  • [41] NEURAL NOISE EMBEDDING FOR END-TO-END SPEECH ENHANCEMENT WITH CONDITIONAL LAYER NORMALIZATION
    Zhang, Zhihui
    Li, Xiaoqi
    Li, Yaxing
    Dong, Yuanjie
    Wang, Dan
    Xiong, Shengwu
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7113 - 7117
  • [42] Speech Segmentation Optimization using Segmented Bilingual Speech Corpus for End-to-end Speech Translation
    Fukuda, Ryo
    Sudoh, Katsuhito
    Nakamura, Satoshi
    INTERSPEECH 2022, 2022, : 121 - 125
  • [43] End-to-End Chinese Speaker Identification
    Yu, Dian
    Zhou, Ben
    Yu, Dong
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 2274 - 2285
  • [44] End-to-End Active Speaker Detection
    Alcazar, Juan Leon
    Cordes, Moritz
    Zhao, Chen
    Ghanem, Bernard
    COMPUTER VISION, ECCV 2022, PT XXXVII, 2022, 13697 : 126 - 143
  • [45] Self-Supervised Representations Improve End-to-End Speech Translation
    Wu, Anne
    Wang, Changhan
    Pino, Juan
    Gu, Jiatao
    INTERSPEECH 2020, 2020, : 1491 - 1495
  • [46] End-to-end Speech Translation by Integrating Cross-modal Information
    Liu Y.-C.
    Zong C.-Q.
    Ruan Jian Xue Bao/Journal of Software, 2023, 34 (04): : 1837 - 1849
  • [47] AN EMPIRICAL STUDY OF END-TO-END SIMULTANEOUS SPEECH TRANSLATION DECODING STRATEGIES
    Ha Nguyen
    Esteve, Yannick
    Besacier, Laurent
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7528 - 7532
  • [48] SimulMT to SimulST: Adapting Simultaneous Text Translation to End-to-End Simultaneous Speech Translation
    Ma, Xutai
    Pino, Juan
    Koehn, Philipp
    1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (AACL-IJCNLP 2020), 2020, : 582 - 587
  • [49] END-TO-END SPEECH TRANSLATION WITH SELF-CONTAINED VOCABULARY MANIPULATION
    Tu, Mei
    Zhang, Fan
    Liu, Wei
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7929 - 7933
  • [50] Neural End-To-End Speech Translation Leveraged by ASR Posterior Distribution
    Ko, Yuka
    Sudoh, Katsuhito
    Sakti, Sakriani
    Nakamura, Satoshi
    IEICE Transactions on Information and Systems, 2024, E107.D (10) : 1322 - 1331