WESPER: Zero-shot and Realtime Whisper to Normal Voice Conversion for Whisper-based Speech Interactions

被引:2
|
作者
Rekimoto, Jun [1 ,2 ]
机构
[1] Univ Tokyo, 7-3-1 Hongo,Bunkyo Ku, Tokyo, Japan
[2] Sony Comp Sci Labs Kyoto, 13-1 Hontoro Cho,Shimogyo Ku, Kyoto, Kyoto, Japan
关键词
speech interaction; whispered voice; whispered voice conversion; silent speech; artificial intelligence; neural networks; self-supervised learning;
D O I
10.1145/3544548.3580706
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recognizing whispered speech and converting it to normal speech creates many possibilities for speech interaction. Because the sound pressure of whispered speech is significantly lower than that of normal speech, it can be used as a semi-silent speech interaction in public places without being audible to others. Converting whispers to normal speech also improves the speech quality for people with speech or hearing impairments. However, conventional speech conversion techniques do not provide sufficient conversion quality or require speaker-dependent datasets consisting of pairs of whispered and normal speech utterances. To address these problems, we propose WESPER, a zero-shot, real-time whisper-to-normal speech conversion mechanism based on self-supervised learning. WESPER consists of a speech-to-unit (STU) encoder, which generates hidden speech units common to both whispered and normal speech, and a unit-to-speech ( UTS) decoder, which reconstructs speech from the encoded speech units. Unlike the existing methods, this conversion is user-independent and does not require a paired dataset for whispered and normal speech. The UTS decoder can reconstruct speech in any target speaker's voice from speech units, and it requires only an unlabeled target speaker's speech data. We confirmed that the quality of the speech converted from a whisper was improved while preserving its natural prosody. Additionally, we confirmed the effectiveness of the proposed approach to perform speech reconstruction for people with speech or hearing disabilities.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Zero-shot voice conversion based on feature disentanglement
    Guo, Na
    Wei, Jianguo
    Li, Yongwei
    Lu, Wenhuan
    Tao, Jianhua
    SPEECH COMMUNICATION, 2024, 165
  • [2] Streaming ASR Encoder for Whisper-to-Speech Online Voice Conversion
    Avdeeva, Anastasia
    Gusev, Aleksei
    Andzhukaev, Tseren
    Ivanov, Artem
    IEEE OPEN JOURNAL OF SIGNAL PROCESSING, 2024, 5 : 160 - 167
  • [3] Whisper-based spoken term detection systems for search on speech ALBAYZIN evaluation challenge
    Javier Tejedor
    Doroteo T. Toledano
    EURASIP Journal on Audio, Speech, and Music Processing, 2024
  • [4] Whisper-based spoken term detection systems for search on speech ALBAYZIN evaluation challenge
    Tejedor, Javier
    Toledano, Doroteo T.
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2024, 2024 (01)
  • [5] ROBUST DISENTANGLED VARIATIONAL SPEECH REPRESENTATION LEARNING FOR ZERO-SHOT VOICE CONVERSION
    Lian, Jiachen
    Zhang, Chunlei
    Yu, Dong
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6572 - 6576
  • [6] Whisper to normal speech conversion using deep convolutional neural networks
    Lian, Hailun
    Zhou, Jian
    Hu, Yuting
    Zheng, Wenming
    Shengxue Xuebao/Acta Acustica, 2020, 45 (01): : 137 - 144
  • [7] Whisper to normal speech conversion using pitch estimated from spectrum
    Konno, Hideaki
    Kudo, Mineichi
    Imai, Hideyuki
    Sugimoto, Masanori
    SPEECH COMMUNICATION, 2016, 83 : 10 - 20
  • [8] LM-VC: Zero-Shot Voice Conversion via Speech Generation Based on Language Models
    Wang Z.
    Chen Y.
    Xie L.
    Tian Q.
    Wang Y.
    IEEE Signal Processing Letters, 2023, 30 : 1157 - 1161
  • [9] Effectiveness of Cross-Domain Architectures for Whisper-to-Normal Speech Conversion
    Parmar, Mihir
    Doshi, Savan
    Shah, Nirmesh J.
    Patel, Maitreya
    Patil, Havant A.
    2019 27TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2019,
  • [10] Whisper to normal conversion based on low dimension feature mapping
    Zhou, Jian (swjtuzhoujian@163.com), 2018, Science Press (43):