Sinsy: A Deep Neural Network-Based Singing Voice Synthesis System

被引:17
|
作者
Hono, Yukiya [1 ]
Hashimoto, Kei [1 ,2 ]
Oura, Keiichiro [2 ]
Nankaku, Yoshihiko [3 ]
Tokuda, Keiichi [4 ]
机构
[1] Nagoya Inst Technol, Comp Sci, Nagoya, Aichi 4668555, Japan
[2] Nagoya Inst Technol, Comp Sci & Engn, Nagoya, Aichi 4668555, Japan
[3] Nagoya Inst Technol, Dept Elect & Elect Engn, Nagoya, Aichi 4668555, Japan
[4] Nagoya Inst Technol, Elect & Elect Engn, Nagoya, Aichi 4668555, Japan
关键词
Acoustics; Hidden Markov models; Feature extraction; Training; Predictive models; Music; Training data; Automatic pitch correction; neural network; singing voice synthesis; timing modeling; vibrato modeling;
D O I
10.1109/TASLP.2021.3104165
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper presents Sinsy, a deep neural network (DNN)-based singing voice synthesis (SVS) system. In recent years, DNNs have been utilized in statistical parametric SVS systems, and DNN-based SVS systems have demonstrated better performance than conventional hidden Markov model-based ones. SVS systems are required to synthesize a singing voice with pitch and timing that strictly follow a given musical score. Additionally, singing expressions that are not described on the musical score, such as vibrato and timing fluctuations, should be reproduced. The proposed system is composed of four modules: a time-lag model, a duration model, an acoustic model, and a vocoder, and singing voices can be synthesized taking these characteristics of singing voices into account. To better model a singing voice, the proposed system incorporates improved approaches to modeling pitch and vibrato and better training criteria into the acoustic model. In addition, we incorporated PeriodNet, a non-autoregressive neural vocoder with robustness for the pitch, into our systems to generate a high-fidelity singing voice waveform. Moreover, we propose automatic pitch correction techniques for DNN-based SVS to synthesize singing voices with correct pitch even if the training data has out-of-tune phrases. Experimental results show our system can synthesize a singing voice with better timing, more natural vibrato, and correct pitch, and it can achieve better mean opinion scores in subjective evaluation tests.
引用
收藏
页码:2803 / 2815
页数:13
相关论文
共 50 条
  • [11] Neural network-based voice quality measurement technique
    Tarraf, A
    Meyers, M
    IEEE INTERNATIONAL SYMPOSIUM ON COMPUTERS AND COMMUNICATIONS, PROCEEDINGS, 1999, : 375 - 381
  • [12] A singing voice synthesis system based on sinusoidal modeling
    Macon, MW
    JensenLink, L
    Oliverio, J
    Clements, MA
    George, EB
    1997 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I - V: VOL I: PLENARY, EXPERT SUMMARIES, SPECIAL, AUDIO, UNDERWATER ACOUSTICS, VLSI; VOL II: SPEECH PROCESSING; VOL III: SPEECH PROCESSING, DIGITAL SIGNAL PROCESSING; VOL IV: MULTIDIMENSIONAL SIGNAL PROCESSING, NEURAL NETWORKS - VOL V: STATISTICAL SIGNAL AND ARRAY PROCESSING, APPLICATIONS, 1997, : 435 - 438
  • [13] An HMM-based Singing Voice Synthesis System
    Saino, Keijiro
    Zen, Heiga
    Nankaku, Yoshihiko
    Lee, Akinobu
    Tokuda, Keiichi
    INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 2274 - 2277
  • [14] Deep Neural Network-Based Intrusion Detection System through PCA
    Alotaibi, Shoayee Dlaim
    Yadav, Kusum
    Aledaily, Arwa N.
    Alkwai, Lulwah M.
    Dafhalla, Alaa Kamal Yousef
    Almansour, Shahad
    Lingamuthu, Velmurugan
    MATHEMATICAL PROBLEMS IN ENGINEERING, 2022, 2022
  • [15] Deep Neural Network-Based System for Autonomous Navigation in Paddy Field
    Adhikari, Shyam P.
    Kim, Gookhwan
    Kim, Hyongsuk
    IEEE ACCESS, 2020, 8 : 71272 - 71278
  • [16] Intrusion detection system: a deep neural network-based concatenated approach
    Sharma, Hidangmayum Satyajeet
    Singh, Khundrakpam Johnson
    JOURNAL OF SUPERCOMPUTING, 2024, 80 (10): : 13918 - 13948
  • [17] Singing Voice Synthesis Using Deep Autoregressive Neural Networks for Acoustic Modeling
    Yi, Yuan-Hao
    Ai, Yang
    Ling, Zhen-Hua
    Dai, Li-Rong
    INTERSPEECH 2019, 2019, : 2593 - 2597
  • [18] GENERATIVE MOMENT MATCHING NETWORK-BASED RANDOM MODULATION POST-FILTER FOR DNN-BASED SINGING VOICE SYNTHESIS AND NEURAL DOUBLE-TRACKING
    Tamaru, Hiroki
    Saito, Yuki
    Takamichi, Shinnosuke
    Koriyama, Tomoki
    Saruwatari, Hiroshi
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7070 - 7074
  • [19] FAST AND HIGH-QUALITY SINGING VOICE SYNTHESIS SYSTEM BASED ON CONVOLUTIONAL NEURAL NETWORKS
    Nakamura, Kazuhiro
    Takaki, Shinji
    Hashimoto, Kei
    Oura, Keiichiro
    Nankaku, Yoshihiko
    Tokuda, Keiichi
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7239 - 7243
  • [20] Multichannel Singing Voice Separation by Deep Neural Network Informed DOA Constrained CMNMF
    Munoz-Montoro, Antonio J.
    Politis, Archontis
    Drossos, Konstantinos
    Carabias-Orti, Julio J.
    2020 IEEE 22ND INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP), 2020,