Artificial vocal learning guided by speech recognition: What it may tell us about how children learn to speak

被引:1
|
作者
Xu, Anqi [1 ]
van Niekerk, Daniel R. [2 ]
Gerazov, Branislav [3 ]
Krug, Paul Konstantin [4 ]
Birkholz, Peter [4 ]
Prom-on, Santitham [5 ]
Halliday, Lorna F. [2 ,6 ]
Xu, Yi [2 ]
机构
[1] Harbin Inst Technol, Sch Humanities & Social Sci, Shenzhen 518055, Peoples R China
[2] UCL, Dept Speech Hearing & Phonet Sci, London WC1E 6BT, England
[3] Ss Cyril & Methodius Univ Skopje, Fac Elect Engn & Informat Technol, Skopje 1000, RN, North Macedonia
[4] Tech Univ Dresden, Inst Acoust & Speech Commun, D-01062 Dresden, Germany
[5] King Mongkuts Univ Technol Thonburi, Comp Engn Dept, Bangkok 10140, Thailand
[6] Univ Cambridge, Cognit & Brain Sci Unit, Cambridge CB2 1TN, England
关键词
Computational modelling of vocal learning; Phonological perception; Coarticulation; Speech acquisition; Articulatory synthesis; SELF-ORGANIZATION; VOWEL ACQUISITION; MIRROR NEURONS; INFANTS VOWEL; MOTOR; IMITATION; MODEL; LANGUAGE; REPRESENTATIONS; COARTICULATION;
D O I
10.1016/j.wocn.2024.101338
中图分类号
H0 [语言学];
学科分类号
030303 ; 0501 ; 050102 ;
摘要
It has long been a mystery how children learn to speak without formal instructions. Previous research has used computational modelling to help solve the mystery by simulating vocal learning with direct imitation or caregiver feedback, but has encountered difficulty in overcoming the speaker normalisation problem, namely, discrepancies between children's vocalisations and that of adults due to age-related anatomical differences. Here we show that vocal learning can be successfully simulated via recognition-guided vocal exploration without explicit speaker normalisation. We trained an articulatory synthesiser with three-dimensional vocal tract models of an adult and two child configurations of different ages to learn monosyllabic English words consisting of CVC syllables, based on coarticulatory dynamics and two kinds of auditory feedback: (i) acoustic features to simulate universal phonetic perception (or direct imitation), and (ii) a deep-learning-based speech recogniser to simulate native-language phonological perception. Native listeners were invited to evaluate the learned synthetic speech with natural speech as baseline reference. Results show that the English words trained with the speech recogniser were more intelligible than those trained with acoustic features, sometimes close to natural speech. The successful simulation of vocal learning in this study suggests that a combination of coarticulatory dynamics and native-language phonological perception may be critical also for real-life vocal production learning. (c) 2024 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (http:// creativecommons.org/licenses/by/4.0/).
引用
收藏
页数:35
相关论文
共 26 条
  • [21] Learning from failures in microfinance:: What unsuccessful cases tell us about how group-based programs work
    Woolcock, MJV
    AMERICAN JOURNAL OF ECONOMICS AND SOCIOLOGY, 1999, 58 (01) : 17 - 42
  • [22] We are back again! What can artificial intelligence and machine learning models tell us about why countries knock at the door of the IMF?
    Agbloyor, Elikplimi Komla
    Pan, Lei
    Dwumfour, Richard Adjei
    Gyeke-Dako, Agyapomaa
    FINANCE RESEARCH LETTERS, 2023, 57
  • [23] What can 3-, 4-and 5-year-old children tell us about how they know the source of their memories?
    Uta, Kraus
    Guenter, Koehnken
    XIV EUROPEAN CONFERENCE ON DEVELOPMENTAL PSYCHOLOGY (ECDP), 2010, : 55 - 59
  • [24] What do error patterns in processing facial expressions, social interaction scenes and vocal prosody tell us about the way social cognition works in children with 22q11.2DS?
    Elodie Peyroux
    Marie-Noëlle Babinet
    Costanza Cannarsa
    Charline Madelaine
    Emilie Favre
    Caroline Demily
    George A. Michael
    European Child & Adolescent Psychiatry, 2020, 29 : 299 - 313
  • [25] What do error patterns in processing facial expressions, social interaction scenes and vocal prosody tell us about the way social cognition works in children with 22q11.2DS?
    Peyroux, Elodie
    Babinet, Marie-Noelle
    Cannarsa, Costanza
    Madelaine, Charline
    Favre, Emilie
    Demily, Caroline
    Michael, George A.
    EUROPEAN CHILD & ADOLESCENT PSYCHIATRY, 2020, 29 (03) : 299 - 313
  • [26] 'Tell Me and I Forget, Teach Me and I May Remember, Involve Me and I Learn. And That's What It's About.' How a Co-design Methodology is Used in the Delivery of Parents Building Solutions: A Qualitative Study
    Morris, Heather
    Valentine, Cathie
    Cummins, Jonathon
    Dwyer, Andrea
    Skouteris, Helen
    AUSTRALIAN AND NEW ZEALAND JOURNAL OF FAMILY THERAPY, 2019, 40 (04) : 368 - 382