Double articulation analyzer with deep sparse autoencoder for unsupervised word discovery from speech signals

被引:60
|
作者
Taniguchi, Tadahiro [1 ]
Nakashima, Ryo [2 ]
Liu, Hailong [2 ]
Nagasaka, Shogo [2 ]
机构
[1] Ritsumeikan Univ, Coll Informat Sci & Engn, Kusatsu, Japan
[2] Ritsumeikan Univ, Grad Sch Informat Sci & Engn, Kusatsu, Japan
关键词
Bayesian nonparametrics; deep learning; speech recognition; unsupervised learning; word discovery; DRIVING BEHAVIOR; SEGMENTATION; ROBOTICS; MODEL;
D O I
10.1080/01691864.2016.1159981
中图分类号
TP24 [机器人技术];
学科分类号
080202 ; 1405 ;
摘要
Direct word discovery from audio speech signals is a very difficult and challenging problem for a developmental robot. Human infants are able to discover words directly from speech signals, and, to understand human infants' developmental capability using a constructive approach, it is very important to build a machine learning system that can acquire knowledge about words and phonemes, i.e. a language model and an acoustic model, autonomously in an unsupervised manner. To achieve this, the nonparametric Bayesian double articulation analyzer (NPB-DAA) with the deep sparse autoencoder (DSAE) is proposed in this paper. The NPB-DAA has been proposed to achieve totally unsupervised direct word discovery from speech signals. However, the performance was still unsatisfactory, although it outperformed pre-existing unsupervised learning methods. In this paper, we integrate the NPB-DAA with the DSAE, which is a neural network model that can be trained in an unsupervised manner, and demonstrate its performance through an experiment about direct word discovery from auditory speech signals. The experiment shows that the combined method, the NPB-DAA with the DSAE, outperforms pre-existing unsupervised learning methods, and shows state-of-the-art performance. It is also shown that the proposed method outperforms several standard speech recognizer-based methods with true word dictionaries.
引用
收藏
页码:770 / 783
页数:14
相关论文
共 50 条
  • [31] From Speech Signals to Semantics - Tagging Performance at Acoustic, Phonetic and Word Levels
    Qian, Yao
    Ubale, Rutuja
    Lange, Patrick
    Evanini, Keelan
    Soong, Frank
    [J]. 2018 11TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2018, : 280 - 284
  • [32] Estimating Public Speaking Anxiety from Speech Signals Using Unsupervised Transfer Learning
    Feng, Kexin
    Yadav, Megha
    Sakib, Md Nazmus
    Behzadan, Amir
    Chaspari, Theodora
    [J]. 2019 7TH IEEE GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING (IEEE GLOBALSIP), 2019,
  • [33] Unsupervised Pattern Discovery from Thematic Speech Archives Based on Multilingual Bottleneck Features
    Sung, Man-Ling
    Feng, Siyuan
    Lee, Tan
    [J]. 2018 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2018, : 1448 - 1455
  • [34] A FACTORIAL DEEP MARKOV MODEL FOR UNSUPERVISED DISENTANGLED REPRESENTATION LEARNING FROM SPEECH
    Khurana, Sameer
    Joty, Shafiq Rayhan
    Ali, Ahmed
    Glass, James
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6540 - 6544
  • [35] AN ITERATIVE DEEP LEARNING FRAMEWORK FOR UNSUPERVISED DISCOVERY OF SPEECH FEATURES AND LINGUISTIC UNITS WITH APPLICATIONS ON SPOKEN TERM DETECTION
    Chung, Cheng-Tao
    Tsai, Cheng-Yu
    Lu, Hsiang-Hung
    Liu, Chia-Hsiang
    Lee, Hung-yi
    Lee, Lin-Shan
    [J]. 2015 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), 2015, : 245 - 251
  • [36] Unsupervised deep learning of foreground objects from low-rank and sparse dataset
    Takeda, Keita
    Sakai, Tomoya
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 240
  • [37] Unsupervised phoneme and word acquisition from continuous speech based on a hierarchical probabilistic generative model
    Nagano M.
    Nakamura T.
    [J]. Advanced Robotics, 2023, 37 (19) : 1253 - 1265
  • [38] Semantic Retrieval of Personal Photos using a Deep Autoencoder Fusing Visual Features with Speech Annotations Represented as Word/Paragraph Vectors
    Lu, Hung-tsung
    Liou, Yuan-ming
    Lee, Hung-yi
    Lee, Lin-shan
    [J]. 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 140 - 144
  • [39] A multi-modal unsupervised fault detection system based on power signals and thermal imaging via deep AutoEncoder neural network
    Cordoni, Francesco
    Bacchiega, Gianluca
    Bondani, Giulio
    Radu, Robert
    Muradore, Riccardo
    [J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2022, 110
  • [40] Random Deep Belief Networks for Recognizing Emotions from Speech Signals
    Wen, Guihua
    Li, Huihui
    Huang, Jubing
    Li, Danyang
    Xun, Eryang
    [J]. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2017, 2017