Sinsy: A Deep Neural Network-Based Singing Voice Synthesis System

被引:14
|
作者
Hono, Yukiya [1 ]
Hashimoto, Kei [1 ,2 ]
Oura, Keiichiro [2 ]
Nankaku, Yoshihiko [3 ]
Tokuda, Keiichi [4 ]
机构
[1] Nagoya Inst Technol, Comp Sci, Nagoya, Aichi 4668555, Japan
[2] Nagoya Inst Technol, Comp Sci & Engn, Nagoya, Aichi 4668555, Japan
[3] Nagoya Inst Technol, Dept Elect & Elect Engn, Nagoya, Aichi 4668555, Japan
[4] Nagoya Inst Technol, Elect & Elect Engn, Nagoya, Aichi 4668555, Japan
关键词
Acoustics; Hidden Markov models; Feature extraction; Training; Predictive models; Music; Training data; Automatic pitch correction; neural network; singing voice synthesis; timing modeling; vibrato modeling;
D O I
10.1109/TASLP.2021.3104165
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper presents Sinsy, a deep neural network (DNN)-based singing voice synthesis (SVS) system. In recent years, DNNs have been utilized in statistical parametric SVS systems, and DNN-based SVS systems have demonstrated better performance than conventional hidden Markov model-based ones. SVS systems are required to synthesize a singing voice with pitch and timing that strictly follow a given musical score. Additionally, singing expressions that are not described on the musical score, such as vibrato and timing fluctuations, should be reproduced. The proposed system is composed of four modules: a time-lag model, a duration model, an acoustic model, and a vocoder, and singing voices can be synthesized taking these characteristics of singing voices into account. To better model a singing voice, the proposed system incorporates improved approaches to modeling pitch and vibrato and better training criteria into the acoustic model. In addition, we incorporated PeriodNet, a non-autoregressive neural vocoder with robustness for the pitch, into our systems to generate a high-fidelity singing voice waveform. Moreover, we propose automatic pitch correction techniques for DNN-based SVS to synthesize singing voices with correct pitch even if the training data has out-of-tune phrases. Experimental results show our system can synthesize a singing voice with better timing, more natural vibrato, and correct pitch, and it can achieve better mean opinion scores in subjective evaluation tests.
引用
收藏
页码:2803 / 2815
页数:13
相关论文
共 50 条
  • [1] Recent Development of the DNN-based Singing Voice Synthesis System - Sinsy
    Hono, Yukiya
    Murata, Shumma
    Nakamura, Kazuhiro
    Hashimoto, Kei
    Oura, Keiichiro
    Nankaku, Yoshihiko
    Tokuda, Keiichi
    [J]. 2018 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2018, : 1003 - 1009
  • [2] Singing voice synthesis based on deep neural networks
    Nishimura, Masanari
    Hashimoto, Kei
    Oura, Keiichiro
    Nankaku, Yoshihiko
    Tokuda, Keiichi
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 2478 - 2482
  • [3] Singing Voice Separation Based on Deep Regression Neural Network
    Yang, Shuqian
    Zhang, Wei-Qiang
    [J]. 2019 IEEE 19TH INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND INFORMATION TECHNOLOGY (ISSPIT 2019), 2019,
  • [4] Korean Singing Voice Synthesis System based on an LSTM Recurrent Neural Network
    Kim, Juntae
    Choi, Heejin
    Park, Jinuk
    Hahn, Minsoo
    Kim, Sangjin
    Kim, Jong-Jin
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 1551 - 1555
  • [5] PROXIMAL DEEP RECURRENT NEURAL NETWORK FOR MONAURAL SINGING VOICE SEPARATION
    Yuan, Weitao
    Wang, Shengbei
    Li, Xiangrui
    Unoki, Masashi
    Wang, Wenwu
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 286 - 290
  • [6] Deep Neural Network-based Machine Translation System Combination
    Zhou, Long
    Zhang, Jiajun
    Kang, Xiaomian
    Zong, Chengqing
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2020, 19 (05)
  • [7] LINEAR-SCALE FILTERBANK FOR DEEP NEURAL NETWORK-BASED VOICE ACTIVITY DETECTION
    Jung, Youngmoon
    Kim, Younggwan
    Lim, Hyungjun
    Kim, Hoirin
    [J]. 2017 20TH CONFERENCE OF THE ORIENTAL CHAPTER OF THE INTERNATIONAL COORDINATING COMMITTEE ON SPEECH DATABASES AND SPEECH I/O SYSTEMS AND ASSESSMENT (O-COCOSDA), 2017, : 43 - 47
  • [8] Adversarial Attack and Defense on Deep Neural Network-Based Voice Processing Systems: An Overview
    Chen, Xiaojiao
    Li, Sheng
    Huang, Hao
    [J]. APPLIED SCIENCES-BASEL, 2021, 11 (18):
  • [9] A wavelet- and neural network-based voice system for a smart wheelchair control
    AL-Rousan, M.
    Assaleh, K.
    [J]. JOURNAL OF THE FRANKLIN INSTITUTE-ENGINEERING AND APPLIED MATHEMATICS, 2011, 348 (01): : 90 - 100
  • [10] Network Security Enhanced with Deep Neural Network-Based Intrusion Detection System
    Alrayes, Fatma S.
    Zakariah, Mohammed
    Amin, Syed Umar
    Khan, Zafar Iqbal
    Alqurni, Jehad Saad
    [J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2024, 80 (01): : 1457 - 1490