Speaker voice normalization for end-to-end speech translation

被引:1
|
作者
Xue, Zhengshan [1 ]
Shi, Tingxun
Zhang, Xiaolei
Xiong, Deyi [1 ]
机构
[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin, Peoples R China
关键词
Machine translation; Speech translation; Speaker normalization;
D O I
10.1016/j.eswa.2024.123317
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speaker voices exhibit acoustic variation. Our preliminary experiments reveal that normalized voice can significantly improve end -to -end speech translation. To mitigate the negative impact of acoustic voice variation across speakers on speech translation, we propose SVN-ST, a Speaker -Voice -Normalized end -to -end Speech Translation framework. In SVN-ST, we use synthetic speech inputs generated from a Text -to -Speech system to complement raw speech inputs. In order to explore synthetic speech inputs, we introduce two essential components for SVN-ST: an alignment adapter at the encoder side and a normalized speech knowledge distillation module at the decoder side. The former forces the representations of raw speech inputs to be close to those of synthetic (normalized) speech inputs while the latter attempts to guide the translations of raw speech inputs with those yielded from synthetic speech inputs. Two additional losses are also defined to equip with the two components. Experimental results on the MuST-C benchmark dataset demonstrate that SVN-ST outperforms previous state-of-the-art end -to -end non -normalized speech translation systems by 0.4 BLEU and cascaded speech translation systems by 2.3 BLEU. On the Covost 2 testset, SVN-ST also outperforms other normalized speech methods on robustness. Further analyses suggest that our model effectively aligns speech representations from different speakers, enhances robustness, and significantly improves sentence -level translation quality.
引用
收藏
页数:11
相关论文
共 50 条
  • [31] TIGHT INTEGRATED END-TO-END TRAINING FOR CASCADED SPEECH TRANSLATION
    Bahar, Parnia
    Bieschke, Tobias
    Schlueter, Ralf
    Ney, Hermann
    2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 950 - 957
  • [32] Regularizing End-to-End Speech Translation with Triangular Decomposition Agreement
    Du, Yichao
    Zhang, Zhirui
    Wang, Weizhi
    Chen, Boxing
    Xie, Jun
    Xu, Tong
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 10590 - 10598
  • [33] SHAS: Approaching optimal Segmentation for End-to-End Speech Translation
    Tsiamas, Ioannis
    Gallego, Gerard I.
    Fonollosa, Jose A. R.
    Costa-jussa, Marta R.
    INTERSPEECH 2022, 2022, : 106 - 110
  • [34] ONE-TO-MANY MULTILINGUAL END-TO-END SPEECH TRANSLATION
    Di Gangi, Mattia A.
    Negri, Matteo
    Turchi, Marco
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 585 - 592
  • [35] Dynamic Speaker Representations Adjustment and Decoder Factorization for Speaker Adaptation in End-to-End Speech Synthesis
    Fu, Ruibo
    Tao, Jianhua
    Wen, Zhengqi
    Yi, Jiangyan
    Wang, Tao
    Qiang, Chunyu
    INTERSPEECH 2020, 2020, : 4701 - 4705
  • [36] End-to-End Multi-Speaker Speech Recognition using Speaker Embeddings and Transfer Learning
    Denisov, Pavel
    Ngoc Thang Vu
    INTERSPEECH 2019, 2019, : 4425 - 4429
  • [37] INCORPORATING END-TO-END FRAMEWORK INTO TARGET-SPEAKER VOICE ACTIVITY DETECTION
    Wang, Weiqing
    Li, Ming
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8362 - 8366
  • [38] Exploring Phoneme-Level Speech Representations for End-to-End Speech Translation
    Salesky, Elizabeth
    Sperber, Matthias
    Black, Alan W.
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 1835 - 1841
  • [39] ATTENTION-BASED END-TO-END SPEECH RECOGNITION ON VOICE SEARCH
    Shan, Changhao
    Zhang, Junbo
    Wang, Yujun
    Xie, Lei
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4764 - 4768
  • [40] Cross-Speaker Emotion Disentangling and Transfer for End-to-End Speech Synthesis
    Li, Tao
    Wang, Xinsheng
    Xie, Qicong
    Wang, Zhichao
    Xie, Lei
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1448 - 1460