Preserving Word-Level Emphasis in Speech-to-Speech Translation

被引:18
|
作者
Quoc Truong Do [1 ]
Toda, Tomoki [2 ]
Neubig, Graham [1 ]
Sakti, Sakriani [1 ]
Nakamura, Satoshi [1 ]
机构
[1] Nara Inst Sci & Technol, Grad Sch Informat Sci, Nara 6300192, Japan
[2] Nagoya Univ, Ctr Informat Technol, Nagoya, Aichi 4648601, Japan
关键词
Emphasis estimation; word-level emphasis; intent; emphasis translation; speech-to-speech translation;
D O I
10.1109/TASLP.2016.2643280
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech-to-speech translation (S2ST) is a technology that translates speech across languages, which can remove barriers in cross-lingual communication. In the conventional S2ST systems, the linguistic meaning of speech was translated, but paralinguistic information conveying other features of the speech such as emotion or emphasis were ignored. In this paper, we propose a method to translate paralinguistic information, specifically focusing on emphasis. The method consists of a series of components that can accurately translate emphasis using all acoustic features of speech. First, linear-regression hidden semi-Markov models (LR-HSMMs) are used to estimate a real-numbered emphasis value for every word in an utterance, resulting in a sequence of values for the utterance. After that the emphasis translation module translates the estimated emphasis sequence into a target language emphasis sequence using a conditional random field model considering the features of emphasis levels, words, and part-of-speech tags. Finally, the speech synthesis module synthesizes emphasized speech with LR-HSMMs, taking into account the translated emphasis sequence and transcription. The results indicate that our translation model can translate emphasis information, correctly emphasizing words in the target language with 91.6% F-measure by objective evaluation. A listening test with human subjects further showed that they could identify the emphasized words with 87.8% F-measure, and that the naturalness of the audio was preserved.
引用
收藏
页码:544 / 556
页数:13
相关论文
共 50 条
  • [1] Preserving Word-level Emphasis in Speech-to-speech Translation using Linear Regression HSMMs
    Quoc Truong Do
    Takamichi, Shinnosuke
    Sakti, Sakriani
    Neubig, Graham
    Toda, Tomoki
    Nakamura, Satoshi
    [J]. 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 3665 - 3669
  • [2] WORD-LEVEL EMPHASIS MODELLING IN HMM-BASED SPEECH SYNTHESIS
    Yu, K.
    Mairesse, F.
    Young, S.
    [J]. 2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 4238 - 4241
  • [3] OUT-OF-VOCABULARY WORD DETECTION IN A SPEECH-TO-SPEECH TRANSLATION SYSTEM
    Kuo, Hong-Kwang
    Kislal, Ellen Eide
    Mangu, Lidia
    Soltau, Hagen
    Beran, Tomas
    [J]. 2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
  • [4] Impacts of machine translation and speech synthesis on speech-to-speech translation
    Hashimoto, Kei
    Yamagishi, Junichi
    Byrne, William
    King, Simon
    Tokuda, Keiichi
    [J]. SPEECH COMMUNICATION, 2012, 54 (07) : 857 - 866
  • [5] The NESPOLE! speech-to-speech translation system
    Lavie, A
    Levin, L
    Frederking, R
    Pianesi, F
    [J]. MACHINE TRANSLATION: FROM RESEARCH TO REAL USERS, 2002, 2499 : 240 - 243
  • [6] Hierarchical Classification for Speech-to-Speech Translation
    Ettelaie, Emil
    Georgiou, Panayiotis G.
    Narayanan, Shrikanth S.
    [J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2534 - 2537
  • [7] Word-level Speech Recognition with a Letter to Word Encoder
    Collobert, Ronan
    Hannun, Awni
    Synnaeve, Gabriel
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 119, 2020, 119
  • [8] Towards Machine Speech-to-speech Translation
    Satoshi, Nakamura
    Sudoh, Katsuhito
    Sakti, Sakriani
    [J]. TRADUMATICA-TRADUCCIO I TECNOLOGIES DE LA INFORMACIO I LA COMUNICACIO, 2019, (17): : 81 - 87
  • [9] Prosody generation for speech-to-speech translation
    Aguero, Pablo Daniel
    Adell, Jordi
    Bonafonte, Antonio
    [J]. 2006 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-13, 2006, : 557 - 560
  • [10] Word-level Speech Recognition with a Letter to Word Encoder
    Collobert, Ronan
    Hannun, Awni
    Synnaeve, Gabriel
    [J]. 25TH AMERICAS CONFERENCE ON INFORMATION SYSTEMS (AMCIS 2019), 2019,