CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing

被引：2

作者：

Wang, Tao ^{[1
,2
]}

Yi, Jiangyan ^{[1
]}

Fu, Ruibo ^{[1
]}

Tao, Jianhua ^{[1
]}

Wen, Zhengqi ^{[1
]}

机构：

[1] Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing 100190, Peoples R China

[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing 100190, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2022年 / 30卷

基金：

中国国家自然科学基金;

关键词：

Speech processing; Decoding; Predictive models; Acoustics; Transfer learning; Training; Task analysis; Coarse-to-fine decoding; mask prediction; one-shot learning; text-based speech editing; text-to-speech; VOCODER; GENERATION; STRAIGHT; NETWORKS;

D O I：

10.1109/TASLP.2022.3190717

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

The text-based speech editor allows the editing of speech through intuitive cutting, copying, and pasting operations to speed up the process of editing speech. However, the major drawback of current systems is that edited speech often sounds unnatural due to cut-copy-paste operation. In addition, it is not obvious how to synthesize records according to a new word not appearing in the transcript. This paper first proposes a novel end-to-end text-based speech editing method called context-aware mask prediction network (CampNet), which can solve unnatural prosody in the edited region and synthesize the speech corresponding to the unseen words in the transcript. Secondly, to cover various situations of text-based speech editing, we design three text-based operations based on CampNet: deletion, insertion, and replacement. Thirdly, to synthesize the speech corresponding to long text, a word-level autoregressive generation method is proposed. Fourthly, we propose a speaker adaptation method using only one sentence for CampNet and explore the ability of few-shot learning based on CampNet, which provides a new idea for speech forgery tasks. The subjective and objective experiments on VCTK and LibriTTS datasets(1) (1) Examples of generated speech can be found at https://hairuo55.github.io/CampNet show that the speech editing results based on CampNet are better than TTS technology, manual editing, and VoCo method. We also conduct detailed ablation experiments to explore the effect of the CampNet structure on its performance. Finally, the experiment shows that speaker adaptation with only one sentence can further improve the naturalness of speech editing for one-shot learning.

引用

页码：2241 / 2254

页数：14

共 50 条

[21] SimulSpeech: End-to-End Simultaneous Speech to Text Translation
Ren, Yi
Liu, Jinglin
Tan, Xu
Zhang, Chen
Qin, Tao
Zhao, Zhou
Liu, Tie-Yan
58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 3787 - 3796
[22] End-to-end Speech-to-Punctuated-Text Recognition
Nozaki, Jumon
Kawahara, Tatsuya
Ishizuka, Kenkichi
Hashimoto, Taiichi
INTERSPEECH 2022, 2022, : 1811 - 1815
[23] End-to-End Mongolian Text-to-Speech System
Li, Jingdong
Zhang, Hui
Liu, Rui
Zhang, Xueliang
Bao, Feilong
2018 11TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2018, : 483 - 487
[24] End-to-End Speech-to-Text Translation: A Survey
Sethiya, Nivedita
Maurya, Chandresh Kumar
Computer Speech and Language, 2025, 90
[25] A COMPARATIVE STUDY ON END-TO-END SPEECH TO TEXT TRANSLATION
Bahar, Parnia
Bieschke, Tobias
Ney, Hermann
2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 792 - 799
[26] End-to-End Speech Synthesis for Bangla with Text Normalization
Pial, Tanzir Islam
Aunti, Shahreen Salim
Ahmed, Shabbir
Heickal, Hasnain
2018 5TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE/ INTELLIGENCE AND APPLIED INFORMATICS (CSII 2018), 2018, : 66 - 71
[27] Speech-and-Text Transformer: Exploiting Unpaired Text for End-to-End Speech Recognition
Wang, Qinyi
Zhou, Xinyuan
Li, Haizhou
APSIPA TRANSACTIONS ON SIGNAL AND INFORMATION PROCESSING, 2023, 12 (01)
[28] An End-to-End Context Aware Anomaly Detection System
Vinzamuri, Bhanukiran
Khabiri, Elham
Bhamidipaty, Anuradha
Mckim, Gregory
Gandhi, Biren
2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 1689 - 1698
[29] CHARACTER-AWARE ATTENTION-BASED END-TO-END SPEECH RECOGNITION
Meng, Zhong
Gaur, Yashesh
Li, Jinyu
Gong, Yifan
2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 949 - 955
[30] DEEP CONTEXT: END-TO-END CONTEXTUAL SPEECH RECOGNITION
Pundak, Golan
Sainath, Tara N.
Prabhavalkar, Rohit
Kannan, Anjuli
Zhao, Ding
2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 418 - 425

← 1 2 3 4 5 →