Modeling speech recognition and synthesis simultaneously: Encoding and decoding lexical and sublexical semantic information into speech with no direct access to speech data

被引:0
|
作者
Begus, Gasper [1 ]
Zhou, Alan [1 ]
机构
[1] Univ Calif Berkeley, Berkeley, CA 94720 USA
来源
关键词
REPRESENTATIONS; GENERATION;
D O I
10.21437/Interspeech.2022-11219
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Human speakers encode information into raw speech which is then decoded by the listeners. This complex relationship between encoding (production) and decoding (perception) is often modeled separately. Here, we test how encoding and decoding of lexical semantic information can emerge automatically from raw speech in unsupervised generative deep convolutional networks that combine the production and perception principles of speech. We introduce, to our knowledge, the most challenging objective in unsupervised lexical learning: a network that must learn unique representations for lexical items with no direct access to training data. We train several models (ciwGAN and fiwGAN [1]) and test how the networks classify acoustic lexical items in unobserved test data. Strong evidence in favor of lexical learning and a causal relationship between latent codes and meaningful sublexical units emerge. The architecture that combines the production and perception principles is thus able to learn to decode unique information from raw acoustic data without accessing real training data directly. We propose a technique to explore lexical (holistic) and sublexical (featural) learned representations in the classifier network. The results bear implications for unsupervised speech technology, as well as for unsupervised semantic modeling as language models increasingly bypass text and operate from raw acoustics.
引用
收藏
页码:5298 / 5302
页数:5
相关论文
共 50 条
  • [31] Refining Synthesized Speech Using Speaker Information and Phone Masking for Data Augmentation of Speech Recognition
    Ueno, Sei
    Lee, Akinobu
    Kawahara, Tatsuya
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 3924 - 3933
  • [32] Deep Learning Enabled Semantic Communications With Speech Recognition and Synthesis
    Weng, Zhenzi
    Qin, Zhijin
    Tao, Xiaoming
    Pan, Chengkang
    Liu, Guangyi
    Li, Geoffrey Ye
    [J]. IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, 2023, 22 (09) : 6227 - 6240
  • [33] Data-Driven Pronunciation Modeling of Swiss German Dialectal Speech for Automatic Speech Recognition
    Stadtschnitzer, Michael
    Schmidt, Christoph
    [J]. PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 3152 - 3156
  • [34] INTEGRATED PRONUNCIATION LEARNING FOR AUTOMATIC SPEECH RECOGNITION USING PROBABILISTIC LEXICAL MODELING
    Rasipuram, Ramya
    Razavi, Marzieh
    Magimai-Doss, Mathew
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 5176 - 5180
  • [35] Articulatory feature based continuous speech recognition using probabilistic lexical modeling
    Rasipuram, Ramya
    Magimai-Doss, Mathew
    [J]. COMPUTER SPEECH AND LANGUAGE, 2016, 36 : 233 - 259
  • [36] Incorporating Proximity Information for Relevance Language Modeling in Speech Recognition
    Chen, Yi-Wen
    Hao, Bo-Han
    Chen, Kuan-Yu
    Chen, Berlin
    [J]. 14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 2682 - 2686
  • [37] LATENT TOPIC MODELING OF WORD VICINITY INFORMATION FOR SPEECH RECOGNITION
    Chen, Kuan-Yu
    Chiu, Hsuan-Sheng
    Chen, Berlin
    [J]. 2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 5394 - 5397
  • [38] The influence of lexical-access ability and vocabulary knowledge on measures of speech recognition in noise
    Kaandorp, Marre W.
    De Groot, Annette M. B.
    Festen, Joost M.
    Smits, Cas
    Goverts, S. Theo
    [J]. INTERNATIONAL JOURNAL OF AUDIOLOGY, 2016, 55 (03) : 157 - 167
  • [39] Semantic Data Augmentation for End-to-End Mandarin Speech Recognition
    Sun, Jianwei
    Tang, Zhiyuan
    Yin, Hengxin
    Wang, Wei
    Zhao, Xi
    Zhao, Shuaijiang
    Lei, Xiaoning
    Zou, Wei
    Li, Xiangang
    [J]. INTERSPEECH 2021, 2021, : 1269 - 1273
  • [40] COMBINING MISSING-DATA RECONSTRUCTION AND UNCERTAINTY DECODING FOR ROBUST SPEECH RECOGNITION
    Gonzalez, Jose A.
    Peinado, Antonio M.
    Gomez, Angel M.
    Ma, Ning
    Barker, Jon
    [J]. 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2012, : 4693 - 4696