High-quality Voice Conversion Using Spectrogram-Based WaveNet Vocoder

被引:0
|
作者
Chen, Kuan [1 ]
Chen, Bo [1 ]
Lai, Jiahao [1 ]
Yu, Kai [1 ]
机构
[1] Shanghai Jiao Tong Univ, Key Lab Shanghai Educ Commiss Intelligent Interac, Brain Sci & Technol Res Ctr, SpeechLab,Dept Comp Sci & Engn, Shanghai, Peoples R China
关键词
voice conversion; WaveNet vocoder; mel-frequency spectrogram; LSTM-RNN; SYSTEM; TIME;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Waveform generator is a key component in voice conversion. Recently, WaveNet waveform generator conditioned on the Mel-cepstrum (Mcep) has shown better quality over standard vocoder. In this paper, an enhanced WaveNet model based on spectrogram is proposed to further improve voice conversion performance. Here, Mel-frequency spectrogram is converted from source speaker to target speaker using an LSTMRNN based frame-to-frame feature mapping. To evaluate the performance, the proposed approach is compared to an Mcep based LSTM-RNN voice conversion system. Both STRAIGHT vocoder and Mcep-based WaveNet vocoder are elected to produce the converted speech for Mcep conversion system. The fundamental frequency (F-0) of the converted speech in different systems is analyzed. The naturalness, similarity and intelligibility are evaluated in subjective measures. Results show that the spectrogram based WaveNet waveform generator can achieve better voice conversion quality compared to traditional WaveNet approaches. The Mel-spectrogram based voice conversion can achieve significant improvement in speaker similarity and inherent F-0 conversion.
引用
收藏
页码:1993 / 1997
页数:5
相关论文
共 50 条
  • [41] STRAIGHT: An extremely high-quality VOCODER for auditory and speech perception research
    Kawahara, H
    [J]. COMPUTATIONAL MODELS OF AUDITORY FUNCTION, 2001, 312 : 343 - 354
  • [42] Voice Conversion Using Dynamic Features for High Quality Transformation
    Wang, Wei
    Yang, Zhen
    [J]. SECOND INTERNATIONAL CONFERENCE ON DIGITAL IMAGE PROCESSING, 2010, 7546
  • [43] On the Assessment of High-Quality Voice Recordings including Voice Postprocessing
    Beerends, John G.
    Beerends, Imre
    [J]. JOURNAL OF THE AUDIO ENGINEERING SOCIETY, 2015, 63 (03): : 174 - 183
  • [44] High Quality Voice Conversion based on ISODATA Clustering Algorithm
    Li, Yanping
    Zuo, Yutao
    Yang, Zhen
    Shao, Xi
    [J]. 2017 12TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS AND KNOWLEDGE ENGINEERING (IEEE ISKE), 2017,
  • [45] LIFTER TRAINING AND SUB-BAND MODELING FOR COMPUTATIONALLY EFFICIENT AND HIGH-QUALITY VOICE CONVERSION USING SPECTRAL DIFFERENTIALS
    Saeki, Takaaki
    Saito, Yuki
    Takamichi, Shinnosuke
    Saruwatari, Hiroshi
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7784 - 7788
  • [46] Acoustic detection of North Atlantic right whale contact calls using spectrogram-based statistics
    Urazghildiieva, Ildar R.
    Clark, Christopher W.
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2007, 122 (02): : 769 - 776
  • [47] WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications
    Morise, Masanori
    Yokomori, Fumiya
    Ozawa, Kenji
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2016, E99D (07): : 1877 - 1884
  • [48] END-TO-END ZERO-SHOT VOICE CONVERSION USING A DDSP VOCODER
    Nercessian, Shahan
    [J]. 2021 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS (WASPAA), 2021, : 306 - 310
  • [49] Implementation of sequential real-time waveform generator for high-quality vocoder
    Morise, Masanori
    [J]. 2020 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2020, : 821 - 825
  • [50] Unsupervised Acoustic Unit Representation Learning for Voice Conversion using WaveNet Auto-encoders
    Chen, Mingjie
    Hain, Thomas
    [J]. INTERSPEECH 2020, 2020, : 4866 - 4870