Customization of IBM Intu's Voice by Connecting Text-to-Speech Services with a Voice Conversion Network

被引:0
|
作者
Song, Jongyoon [1 ]
Lee, Jaekoo [1 ]
Kim, Hyunjae [1 ]
Choi, Euishin [2 ]
Kim, Minseok [3 ]
Yoon, Sungroh [1 ]
机构
[1] Seoul Natl Univ, ECE, Seoul, South Korea
[2] IBM Korea, Client Innovat Lab, Seoul, South Korea
[3] IBM Korea, Developer Outreach Team, Seoul, South Korea
基金
新加坡国家研究基金会;
关键词
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
IBM has recently launched Project Intu, which extends the existing web-based cognitive service Watson with the Internet of Things to provide an intelligent personal assistant service. We propose a voice customization service that allows a user to directly customize the voice of Intu. The method for voice customization is based on IBM Watson's text-to-speech service and voice conversion model. A user can train the voice conversion model by providing a minimum of approximately 100 speech samples in the preferred voice (target voice). The output voice of Intu (source voice) is then converted into the target voice. Furthermore, the user does not need to offer parallel data for the target voice since the transcriptions of the source speech and target speech are the same. We also suggest methods to maximize the efficiency of voice conversion and determine the proper amount of target speech based on several experiments. When we measured the elapsed time for each process, we observed that feature extraction accounts for 59.7% of voice conversion time, which implies that fixing inefficiencies in feature extraction should be prioritized. We used the mel-cepstral distortion between the target speech and reconstructed speech as an index for conversion accuracy and found that, when the number of target speech samples for training is less than 100, the general performance of the model degrades.
引用
收藏
页码:830 / 839
页数:10
相关论文
共 50 条
  • [1] Spectral voice conversion for text-to-speech synthesis
    Kain, A
    Macon, MW
    [J]. PROCEEDINGS OF THE 1998 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-6, 1998, : 285 - 288
  • [2] Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining
    Huang, Wen-Chin
    Hayashi, Tomoki
    Wu, Yi-Chiao
    Kameoka, Hirokazu
    Toda, Tomoki
    [J]. INTERSPEECH 2020, 2020, : 4676 - 4680
  • [3] EMOTIONAL VOICE CONVERSION USING MULTITASK LEARNING WITH TEXT-TO-SPEECH
    Kim, Tae-Ho
    Cho, Sungjae
    Choi, Shinkook
    Park, Sejik
    Lee, Soo-Young
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7774 - 7778
  • [4] Development of robotic voice conversion for RIBO using text-to-speech synthesis
    Hossain, Md. Jakir
    Al Amin, Sayed Mahmud
    Islam, Md. Saiful
    Marium-E-Jannat
    [J]. 2018 4TH INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING AND INFORMATION & COMMUNICATION TECHNOLOGY (ICEEICT), 2018, : 422 - 425
  • [5] Voice Builder: A Tool for Building Text-To-Speech Voices
    De Silva, Pasindu
    Wattanavekin, Theeraphol
    Hao, Tang
    Pipatsrisawat, Knot
    [J]. PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 2241 - 2245
  • [6] EfficientTTS 2: Variational End-to-End Text-to-Speech Synthesis and Voice Conversion
    Miao, Chenfeng
    Zhu, Qingying
    Chen, Minchuan
    Ma, Jun
    Wang, Shaojun
    Xiao, Jing
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1650 - 1661
  • [7] VOICE FILTER: FEW-SHOT TEXT-TO-SPEECH SPEAKER ADAPTATION USING VOICE CONVERSION AS A POST-PROCESSING MODULE
    Gabrys, Adam
    Huybrechts, Goeric
    Ribeiro, Manuel Sam
    Chien, Chung-Ming
    Roth, Julian
    Comini, Giulia
    Barra-Chicote, Roberto
    Perz, Bartek
    Lorenzo-Trueba, Jaime
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7902 - 7906
  • [8] Text-to-Speech Software and Learning: Investigating the Relevancy of the Voice Effect
    Craig, Scotty D.
    Schroeder, Noah L.
    [J]. JOURNAL OF EDUCATIONAL COMPUTING RESEARCH, 2019, 57 (06) : 1534 - 1548
  • [9] Deep Voice: Real-time Neural Text-to-Speech
    Arik, Sercan O.
    Chrzanowski, Mike
    Coates, Adam
    Diamos, Gregory
    Gibiansky, Andrew
    Kang, Yongguo
    Li, Xian
    Miller, John
    Ng, Andrew
    Raiman, Jonathan
    Sengupta, Shubho
    Shoeybi, Mohammad
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70
  • [10] Web Voice Browser Based on an ISLPC Text-to-Speech Algorithm
    LIAO Rikun
    [J]. Wuhan University Journal of Natural Sciences, 2006, (05) : 1157 - 1160