Improving Generative Adversarial Network-based Vocoding through Multi-scale Convolution

被引：0

作者：

Li, Wanting ^{[1
]}

Chen, Yiting ^{[1
]}

Tang, Buzhou ^{[2
,3
]}

机构：

[1] Harbin Inst Technol Shenzhen, Shenzhen, Guangdong, Peoples R China

[2] Harbin Inst Technol Shenzhen, Shenzhen, Peoples R China

[3] Pengcheng Lab, Shenzhen, Peoples R China

来源：

ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING | 2023年 / 22卷 / 09期

关键词：

Speech generation; neural vocoder; SPEECH SYNTHESIS;

D O I：

10.1145/3610532

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Vocoding is a sub-process of text-to-speech task, which aims at generating audios from intermediate representations between text and audio. Several recent works have shown that generative adversarial network(GAN) based vocoders can generate audios with high quality. While GAN-based neural vocoders have shown higher efficiency in generating speed than autoregressive vocoders, the audio fidelity still cannot compete with ground-truth samples. One major cause of the degradation in audio quality and spectrogram vague comes from the average pooling layers in discriminator. As the multi-scale discriminator commonly used by recent GAN-based vocoders applies several average pooling layers to capture different-frequency bands, we believe it is crucial to prevent the high-frequency information from leakage in the average pooling process. This article proposesMSCGAN, which solves the above-mentioned problem and achieves higher-fidelity speech synthesis. We demonstrate that substituting the average pooling process with a multi-scale convolution architecture effectively retains high-frequency features and thus forces the generator to recover audio details in time and frequency domain. Compared with other state-of-the-art GAN-based vocoders, MSCGAN can produce competitive audio with a higher spectrogram clarity and mean opinion score score in subjective human evaluation.

引用

页数：10

共 50 条

[21] A multi-scale generative adversarial network for real-world image denoising
Xiaojun Yu
Zixuan Fu
Chenkun Ge
Signal, Image and Video Processing, 2022, 16 : 257 - 264
[22] A multi-scale generative adversarial network for real-world image denoising
Yu, Xiaojun
Fu, Zixuan
Ge, Chenkun
SIGNAL IMAGE AND VIDEO PROCESSING, 2022, 16 (01) : 257 - 264
[23] Multi-Scale Attention Generative Adversarial Network for Single Image Rain Removal
Wang, Wanwei
PATTERN RECOGNITION AND IMAGE ANALYSIS, 2022, 32 (02) : 436 - 447
[24] Image compressed sensing using multi-scale residual generative adversarial network
Jinpeng Tian
Wenjie Yuan
Yunxuan Tu
The Visual Computer, 2022, 38 : 4193 - 4202
[25] Image compressed sensing using multi-scale residual generative adversarial network
Tian, Jinpeng
Yuan, Wenjie
Tu, Yunxuan
VISUAL COMPUTER, 2022, 38 (12): : 4193 - 4202
[26] Multi-scale generative adversarial network for image compressed sensing and reconstruction algorithm
Zeng C.-Y.
Yan K.
Wang Z.-F.
Wang Z.-H.
Jilin Daxue Xuebao (Gongxueban)/Journal of Jilin University (Engineering and Technology Edition), 2023, 53 (10): : 2923 - 2931
[27] Multi-Scale Attention Generative Adversarial Network for Single Image Rain Removal
Pattern Recognition and Image Analysis, 2022, 32 : 436 - 447
[28] Multi-scale Generative Adversarial Networks for Speech Enhancement
Li, Yihang
Jiang, Ting
Qin, Shan
2019 7TH IEEE GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING (IEEE GLOBALSIP), 2019,
[29] Multi-scale Generative Adversarial Networks for Crowd Counting
Yang, Jianxing
Zhou, Yuan
Kung, Sun-Yuan
2018 24TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2018, : 3244 - 3249
[30] Multi-scale multi-class conditional generative adversarial network for handwritten character generation
Liu, Jin
Gu, Chenkai
Wang, Jin
Youn, Geumran
Kim, Jeong-Uk
JOURNAL OF SUPERCOMPUTING, 2019, 75 (04): : 1922 - 1940

← 1 2 3 4 5 →