Modeling speech recognition and synthesis simultaneously: Encoding and decoding lexical and sublexical semantic information into speech with no direct access to speech data

被引:0
|
作者
Begus, Gasper [1 ]
Zhou, Alan [1 ]
机构
[1] Univ Calif Berkeley, Berkeley, CA 94720 USA
来源
关键词
REPRESENTATIONS; GENERATION;
D O I
10.21437/Interspeech.2022-11219
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Human speakers encode information into raw speech which is then decoded by the listeners. This complex relationship between encoding (production) and decoding (perception) is often modeled separately. Here, we test how encoding and decoding of lexical semantic information can emerge automatically from raw speech in unsupervised generative deep convolutional networks that combine the production and perception principles of speech. We introduce, to our knowledge, the most challenging objective in unsupervised lexical learning: a network that must learn unique representations for lexical items with no direct access to training data. We train several models (ciwGAN and fiwGAN [1]) and test how the networks classify acoustic lexical items in unobserved test data. Strong evidence in favor of lexical learning and a causal relationship between latent codes and meaningful sublexical units emerge. The architecture that combines the production and perception principles is thus able to learn to decode unique information from raw acoustic data without accessing real training data directly. We propose a technique to explore lexical (holistic) and sublexical (featural) learned representations in the classifier network. The results bear implications for unsupervised speech technology, as well as for unsupervised semantic modeling as language models increasingly bypass text and operate from raw acoustics.
引用
收藏
页码:5298 / 5302
页数:5
相关论文
共 50 条
  • [21] Is speech recognition automatic? Lexical competition, but not initial lexical access, requires cognitive resources
    Zhang, Xujin
    Samuel, Arthur G.
    [J]. MUTATION RESEARCH-REVIEWS IN MUTATION RESEARCH, 2018, 775 : 32 - 50
  • [22] The influence of speech rate and accent on access and use of semantic information
    Sajin, Stanislav M.
    Connine, Cynthia M.
    [J]. QUARTERLY JOURNAL OF EXPERIMENTAL PSYCHOLOGY, 2017, 70 (04): : 619 - 636
  • [23] DIRECT SAMPLE INTERPOLATION (DSI) SPEECH SYNTHESIS - AN INTERPOLATION TECHNIQUE FOR DIGITAL SPEECH DATA-COMPRESSION AND SPEECH SYNTHESIS
    BEDDOES, MP
    CHU, TK
    [J]. IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1982, 30 (06): : 825 - 832
  • [24] Information-theoretic analysis of efficiency of the phonetic encoding-decoding method in automatic speech recognition
    Savchenko, V. V.
    Savchenko, A. V.
    [J]. JOURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS, 2016, 61 (04) : 430 - 435
  • [25] DISCRIMINATIVE LANGUAGE MODELING FOR SPEECH RECOGNITION WITH RELEVANCE INFORMATION
    Chen, Berlin
    Liu, Jia-Wen
    [J]. 2011 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2011,
  • [26] A New Bidirectional Neural Network for Lexical Modeling and Speech Recognition Improvement
    Yazdchi, M. R.
    Salehi, S. A. Seyyed
    Zafarani, R.
    [J]. SCIENTIA IRANICA, 2007, 14 (06) : 571 - 578
  • [27] Exploiting speech production information for automatic speech and speaker modeling and recognition - possibilities and new opportunities
    Ramanarayanan, Vikram
    Ghosh, Prasanta Kumar
    Lammert, Adam
    Narayanan, Shrikanth S.
    [J]. 2012 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2012,
  • [28] Speech recognition and direct data entry in clinical microbiology
    OHara, SP
    Athersuch, R
    [J]. BRITISH JOURNAL OF BIOMEDICAL SCIENCE, 1996, 53 (03) : 209 - 213
  • [29] Age-Related Differences in Lexical Access Relate to Speech Recognition in Noise
    Carroll, Rebecca
    Warzybok, Anna
    Kollmeier, Birger
    Ruigendijk, Esther
    [J]. FRONTIERS IN PSYCHOLOGY, 2016, 7
  • [30] DEEPTALK: VOCAL STYLE ENCODING FOR SPEAKER RECOGNITION AND SPEECH SYNTHESIS
    Chowdhury, Anurag
    Ross, Arun
    David, Prabu
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6189 - 6193