Modeling speech recognition and synthesis simultaneously: Encoding and decoding lexical and sublexical semantic information into speech with no direct access to speech data

被引：0

作者：

Begus, Gasper ^{[1
]}

Zhou, Alan ^{[1
]}

机构：

[1] Univ Calif Berkeley, Berkeley, CA 94720 USA

来源：

INTERSPEECH 2022 | 2022年

关键词：

REPRESENTATIONS; GENERATION;

D O I：

10.21437/Interspeech.2022-11219

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Human speakers encode information into raw speech which is then decoded by the listeners. This complex relationship between encoding (production) and decoding (perception) is often modeled separately. Here, we test how encoding and decoding of lexical semantic information can emerge automatically from raw speech in unsupervised generative deep convolutional networks that combine the production and perception principles of speech. We introduce, to our knowledge, the most challenging objective in unsupervised lexical learning: a network that must learn unique representations for lexical items with no direct access to training data. We train several models (ciwGAN and fiwGAN [1]) and test how the networks classify acoustic lexical items in unobserved test data. Strong evidence in favor of lexical learning and a causal relationship between latent codes and meaningful sublexical units emerge. The architecture that combines the production and perception principles is thus able to learn to decode unique information from raw acoustic data without accessing real training data directly. We propose a technique to explore lexical (holistic) and sublexical (featural) learned representations in the classifier network. The results bear implications for unsupervised speech technology, as well as for unsupervised semantic modeling as language models increasingly bypass text and operate from raw acoustics.

引用

页码：5298 / 5302

页数：5

共 50 条

[31] Refining Synthesized Speech Using Speaker Information and Phone Masking for Data Augmentation of Speech Recognition
Ueno, Sei
Lee, Akinobu
Kawahara, Tatsuya
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 3924 - 3933
[32] Deep Learning Enabled Semantic Communications With Speech Recognition and Synthesis
Weng, Zhenzi
Qin, Zhijin
Tao, Xiaoming
Pan, Chengkang
Liu, Guangyi
Li, Geoffrey Ye
[J]. IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, 2023, 22 (09) : 6227 - 6240
[33] Data-Driven Pronunciation Modeling of Swiss German Dialectal Speech for Automatic Speech Recognition
Stadtschnitzer, Michael
Schmidt, Christoph
[J]. PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 3152 - 3156
[34] INTEGRATED PRONUNCIATION LEARNING FOR AUTOMATIC SPEECH RECOGNITION USING PROBABILISTIC LEXICAL MODELING
Rasipuram, Ramya
Razavi, Marzieh
Magimai-Doss, Mathew
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 5176 - 5180
[35] Articulatory feature based continuous speech recognition using probabilistic lexical modeling
Rasipuram, Ramya
Magimai-Doss, Mathew
[J]. COMPUTER SPEECH AND LANGUAGE, 2016, 36 : 233 - 259
[36] Incorporating Proximity Information for Relevance Language Modeling in Speech Recognition
Chen, Yi-Wen
Hao, Bo-Han
Chen, Kuan-Yu
Chen, Berlin
[J]. 14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 2682 - 2686
[37] LATENT TOPIC MODELING OF WORD VICINITY INFORMATION FOR SPEECH RECOGNITION
Chen, Kuan-Yu
Chiu, Hsuan-Sheng
Chen, Berlin
[J]. 2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 5394 - 5397
[38] The influence of lexical-access ability and vocabulary knowledge on measures of speech recognition in noise
Kaandorp, Marre W.
De Groot, Annette M. B.
Festen, Joost M.
Smits, Cas
Goverts, S. Theo
[J]. INTERNATIONAL JOURNAL OF AUDIOLOGY, 2016, 55 (03) : 157 - 167
[39] Semantic Data Augmentation for End-to-End Mandarin Speech Recognition
Sun, Jianwei
Tang, Zhiyuan
Yin, Hengxin
Wang, Wei
Zhao, Xi
Zhao, Shuaijiang
Lei, Xiaoning
Zou, Wei
Li, Xiangang
[J]. INTERSPEECH 2021, 2021, : 1269 - 1273
[40] COMBINING MISSING-DATA RECONSTRUCTION AND UNCERTAINTY DECODING FOR ROBUST SPEECH RECOGNITION
Gonzalez, Jose A.
Peinado, Antonio M.
Gomez, Angel M.
Ma, Ning
Barker, Jon
[J]. 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2012, : 4693 - 4696

← 1 2 3 4 5 →