Combining Vocal Tract Length Normalization With Hierarchical Linear Transformations

被引：2

作者：

Saheer, Lakshmi ^{[1
,2
]}

Yamagishi, Junichi ^{[3
,4
]}

Garner, Philip N. ^{[1
]}

Dines, John ^{[1
]}

机构：

[1] Idiap Res Inst, CH-1920 Martigny, Switzerland

[2] Ecole Polytech Fed Lausanne, CH-1015 Lausanne, Switzerland

[3] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh EH8 9AB, Midlothian, Scotland

[4] Natl Inst Informat, Tokyo 1018430, Japan

来源：

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING | 2014年 / 8卷 / 02期

基金：

英国工程与自然科学研究理事会;

关键词：

Constrained structural maximum a posteriori linear regression; hidden Markov models; speaker adaptation; statistical parametric speech synthesis; vocal tract length normalization; SPEAKER ADAPTATION; SPEECH;

D O I：

10.1109/JSTSP.2013.2295554

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Recent research has demonstrated the effectiveness of vocal tract length normalization (VTLN) as a rapid adaptation technique for statistical parametric speech synthesis. VTLN produces speech with naturalness preferable to that of MLLR-based adaptation techniques, being much closer in quality to that generated by the original average voice model. However, with only a single parameter, VTLN captures very few speaker specific characteristics when compared to linear transform based adaptation techniques. This paper shows that the merits of VTLN can be combined with those of linear transform based adaptation in a hierarchical Bayesian framework, where VTLN is used as the prior information. A novel technique for propagating the gender and age information captured by the VTLN transform into constrained structural maximum a posteriori linear regression (CSMAPLR) adaptation is presented. This paper also compares this proposed technique to other combination techniques. Experiments are performed on both matched and mismatched training and test conditions, including gender, age, and recording environments. Text-to-speech (TTS) synthesis experiments show that the resulting transformation produces improved speech quality with better naturalness and intelligibility (similar to VTLN transformation) when compared to the CSMAPLR transformation, especially when the quantity of adaptation data is very limited. With more parameters to capture speaker characteristics, the proposed method performs better in speaker similarity compared to VTLN in mis-matched conditions. Hence, the proposed combination combines the quality and intelligibility of VTLN with the speaker similarity of CSMAPLR especially in the mismatched train and test conditions. Experiments are also performed using the automatic speech recognition (ASR) system in a unified framework as that of synthesis. This is to prove that the techniques developed for TTS can be plugged into ASR in order to improve the performance.

引用

页码：262 / 272

页数：11

共 50 条

[1] COMBINING VOCAL TRACT LENGTH NORMALIZATION WITH HIERARCHIAL LINEAR TRANSFORMATIONS
Saheer, Lakshmi
Yamagishi, Junichi
Garner, Philip N.
Dines, John
[J]. 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2012, : 4493 - 4496
[2] A parametric approach to vocal tract length normalization
Eide, E
Gish, H
[J]. 1996 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, CONFERENCE PROCEEDINGS, VOLS 1-6, 1996, : 346 - 348
[3] Time domain vocal tract length normalization
Sündermann, D
Bonafonte, A
Ney, H
Hoge, H
[J]. Proceedings of the Fourth IEEE International Symposium on Signal Processing and Information Technology, 2004, : 191 - 194
[4] Parameter optimization for Vocal Tract Length Normalization
Dognin, P
El-Jaroudi, A
Billa, J
[J]. 2000 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS, VOLS I-VI, 2000, : 1767 - 1770
[5] An Approach to Vocal Tract Length Normalization by Robust Formant
Kabir, A.
Barker, J.
Giurgiu, M.
[J]. RECENT ADVANCES IN CIRCUITS, SYSTEMS AND SIGNALS, 2010, : 345 - +
[6] Vocal Tract Length Normalization Features for Audio Search
Madhavi, Maulik C.
Sharma, Shubham
Patil, Hemant A.
[J]. TEXT, SPEECH, AND DIALOGUE (TSD 2015), 2015, 9302 : 387 - 395
[7] A bilinear transform approach for vocal tract length normalization
Xu, W
Wang, BX
Ding, Q
[J]. 2004 8TH INTERNATIONAL CONFERENCE ON CONTROL, AUTOMATION, ROBOTICS AND VISION, VOLS 1-3, 2004, : 547 - 551
[8] The ΔF method of vocal tract length normalization for vowels
Johnson, Keith
[J]. LABORATORY PHONOLOGY, 2020, 11 (01):
[9] A frequency warping approach for vocal tract length normalization
Ding, Q
Xu, W
Wang, BX
[J]. 2004 7TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS, VOLS 1-3, 2004, : 691 - 694
[10] Region-Based Vocal Tract Length Normalization for ASR
Maragakis, Michail G.
Potamianos, Alexandros
[J]. INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 1365 - 1368

← 1 2 3 4 5 →