Improving Generative Adversarial Network-based Vocoding through Multi-scale Convolution

被引:0
|
作者
Li, Wanting [1 ]
Chen, Yiting [1 ]
Tang, Buzhou [2 ,3 ]
机构
[1] Harbin Inst Technol Shenzhen, Shenzhen, Guangdong, Peoples R China
[2] Harbin Inst Technol Shenzhen, Shenzhen, Peoples R China
[3] Pengcheng Lab, Shenzhen, Peoples R China
关键词
Speech generation; neural vocoder; SPEECH SYNTHESIS;
D O I
10.1145/3610532
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vocoding is a sub-process of text-to-speech task, which aims at generating audios from intermediate representations between text and audio. Several recent works have shown that generative adversarial network(GAN) based vocoders can generate audios with high quality. While GAN-based neural vocoders have shown higher efficiency in generating speed than autoregressive vocoders, the audio fidelity still cannot compete with ground-truth samples. One major cause of the degradation in audio quality and spectrogram vague comes from the average pooling layers in discriminator. As the multi-scale discriminator commonly used by recent GAN-based vocoders applies several average pooling layers to capture different-frequency bands, we believe it is crucial to prevent the high-frequency information from leakage in the average pooling process. This article proposesMSCGAN, which solves the above-mentioned problem and achieves higher-fidelity speech synthesis. We demonstrate that substituting the average pooling process with a multi-scale convolution architecture effectively retains high-frequency features and thus forces the generator to recover audio details in time and frequency domain. Compared with other state-of-the-art GAN-based vocoders, MSCGAN can produce competitive audio with a higher spectrogram clarity and mean opinion score score in subjective human evaluation.
引用
收藏
页数:10
相关论文
共 50 条
  • [21] A multi-scale generative adversarial network for real-world image denoising
    Xiaojun Yu
    Zixuan Fu
    Chenkun Ge
    Signal, Image and Video Processing, 2022, 16 : 257 - 264
  • [22] A multi-scale generative adversarial network for real-world image denoising
    Yu, Xiaojun
    Fu, Zixuan
    Ge, Chenkun
    SIGNAL IMAGE AND VIDEO PROCESSING, 2022, 16 (01) : 257 - 264
  • [23] Multi-Scale Attention Generative Adversarial Network for Single Image Rain Removal
    Wang, Wanwei
    PATTERN RECOGNITION AND IMAGE ANALYSIS, 2022, 32 (02) : 436 - 447
  • [24] Image compressed sensing using multi-scale residual generative adversarial network
    Jinpeng Tian
    Wenjie Yuan
    Yunxuan Tu
    The Visual Computer, 2022, 38 : 4193 - 4202
  • [25] Image compressed sensing using multi-scale residual generative adversarial network
    Tian, Jinpeng
    Yuan, Wenjie
    Tu, Yunxuan
    VISUAL COMPUTER, 2022, 38 (12): : 4193 - 4202
  • [26] Multi-scale generative adversarial network for image compressed sensing and reconstruction algorithm
    Zeng C.-Y.
    Yan K.
    Wang Z.-F.
    Wang Z.-H.
    Jilin Daxue Xuebao (Gongxueban)/Journal of Jilin University (Engineering and Technology Edition), 2023, 53 (10): : 2923 - 2931
  • [27] Multi-Scale Attention Generative Adversarial Network for Single Image Rain Removal
    Pattern Recognition and Image Analysis, 2022, 32 : 436 - 447
  • [28] Multi-scale Generative Adversarial Networks for Speech Enhancement
    Li, Yihang
    Jiang, Ting
    Qin, Shan
    2019 7TH IEEE GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING (IEEE GLOBALSIP), 2019,
  • [29] Multi-scale Generative Adversarial Networks for Crowd Counting
    Yang, Jianxing
    Zhou, Yuan
    Kung, Sun-Yuan
    2018 24TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2018, : 3244 - 3249
  • [30] Multi-scale multi-class conditional generative adversarial network for handwritten character generation
    Liu, Jin
    Gu, Chenkai
    Wang, Jin
    Youn, Geumran
    Kim, Jeong-Uk
    JOURNAL OF SUPERCOMPUTING, 2019, 75 (04): : 1922 - 1940