ANALYSIS OF LAYER-WISE TRAINING IN DIRECT SPEECH TO SPEECH TRANSLATION USING BI-LSTM

被引:4
|
作者
Arya, Lalaram [1 ]
Agarwal, Ayush [1 ]
Mishra, Jagabandhu [1 ]
Prasanna, S. R. Mahadeva [1 ]
机构
[1] Indian Inst Technol Dharwad, Dept Elect Engn, Dharwad 580011, India
关键词
Speech to speech translation (S2ST); Voice Conversion (VC); Bidirectional long short term memory (Bi-LSTM); VOICE CONVERSION;
D O I
10.1109/O-COCOSDA202257103.2022.9997945
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speech-to-speech translation (S2ST) is the process of translation of speech from one language to another. Traditional S2ST systems follow a cascaded approach, where three modules automatics speech recognition (ASR), machine translation (MT), and text-to-speech translation (TTS) are concatenated to obtain the final translated speech utterance. The cascaded nature of the system results in the propagation of errors from one module to another. This, in turn, leads to the degradation in the overall performance of the S2ST task. With the evolution of the deep learning approaches to speech processing, many attempts have been made to perform end-to-end and direct speech-to-speech translation (DS2ST). But most of these approaches rely on language transcripts in one way or the other. In this work, we aim to perform the DS2ST task without using language transcripts. In this direction we have performed three experiments: First, we have investigated the direct learning of mapping function from source to target language with the increase in the number of utterances. Second, we have analyzed how the learning function improves with an increase in the number of Bi-LSTM layers. Third, we have observed how the system behaves with the unknown speakers (not used during training) during inference. From the experiments, it has been observed that with the increase in the number of utterances and layers, the quality of translation improves. And also, with a speaker and text-dependent training of approximately 4.4 hrs of speech, the model can generate the target language utterance even for unknown speakers. Though the generated utterance quality is not that good, but intelligent to some extent to be perceived.
引用
收藏
页数:6
相关论文
共 50 条
  • [1] LAYER-WISE ANALYSIS OF A SELF-SUPERVISED SPEECH REPRESENTATION MODEL
    Pasad, Ankita
    Chou, Ju-Chieh
    Livescu, Karen
    [J]. 2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 914 - 921
  • [2] Speech refinement using Bi-LSTM and improved spectral clustering in speaker diarization
    Gupta, Aishwarya
    Purwar, Archana
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (18) : 54433 - 54448
  • [3] Speech refinement using Bi-LSTM and improved spectral clustering in speaker diarization
    Aishwarya Gupta
    Archana Purwar
    [J]. Multimedia Tools and Applications, 2024, 83 : 54433 - 54448
  • [4] Part-of-speech tagging for Arabic tweets using CRF and Bi-LSTM
    AlKhwiter, Wasan
    Al-Twairesh, Nora
    [J]. COMPUTER SPEECH AND LANGUAGE, 2021, 65 (65):
  • [5] Part-of-Speech Tagging for Arabic Gulf Dialect Using Bi-LSTM
    Alharbi, Randah
    Magdy, Walid
    Darwish, Kareem
    AbdelAli, Ahmed
    Mubarak, Hamdy
    [J]. PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 3925 - 3932
  • [6] Speech neuromuscular decoding based on spectrogram images using conformal predictors with Bi-LSTM
    Wang, You
    Zhang, Ming
    Wu, Rumeng
    Wang, Hengyang
    Luo, Zhiyuan
    Li, Guang
    [J]. NEUROCOMPUTING, 2021, 451 : 25 - 34
  • [7] A Novel Heterogeneous Parallel Convolution Bi-LSTM for Speech Emotion Recognition
    Zhang, Huiyun
    Huang, Heming
    Han, Henry
    [J]. APPLIED SCIENCES-BASEL, 2021, 11 (21):
  • [8] MSAM: A Multi-Layer Bi-LSTM Based Speech to Vector Model with Residual Attention Mechanism
    Cui, Dongdong
    Yin, Shouyi
    Gu, Jiangyuan
    Liu, Leibo
    Wei, Shaojun
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ELECTRON DEVICES AND SOLID-STATE CIRCUITS (EDSSC), 2019,
  • [9] Automatic hate speech detection using aspect based feature extraction and Bi-LSTM model
    Kothuru, Srinivasulu
    Santhanavijayan, A.
    [J]. INTERNATIONAL JOURNAL OF SYSTEM ASSURANCE ENGINEERING AND MANAGEMENT, 2022, 13 (06) : 2934 - 2943
  • [10] Context-Aware Speech Stress Detection in Hospital Workers Using Bi-LSTM Classifiers
    Gaballah, Amr
    Tiwari, Abhishek
    Narayanan, Shrikanth
    Falk, Tiago H.
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 8348 - 8352