Double articulation analyzer with deep sparse autoencoder for unsupervised word discovery from speech signals

被引：60

作者：

Taniguchi, Tadahiro ^{[1
]}

Nakashima, Ryo ^{[2
]}

Liu, Hailong ^{[2
]}

Nagasaka, Shogo ^{[2
]}

机构：

[1] Ritsumeikan Univ, Coll Informat Sci & Engn, Kusatsu, Japan

[2] Ritsumeikan Univ, Grad Sch Informat Sci & Engn, Kusatsu, Japan

来源：

ADVANCED ROBOTICS | 2016年 / 30卷 / 11-12期

关键词：

Bayesian nonparametrics; deep learning; speech recognition; unsupervised learning; word discovery; DRIVING BEHAVIOR; SEGMENTATION; ROBOTICS; MODEL;

D O I：

10.1080/01691864.2016.1159981

中图分类号：

TP24 [机器人技术];

学科分类号：

080202 ; 1405 ;

摘要：

Direct word discovery from audio speech signals is a very difficult and challenging problem for a developmental robot. Human infants are able to discover words directly from speech signals, and, to understand human infants' developmental capability using a constructive approach, it is very important to build a machine learning system that can acquire knowledge about words and phonemes, i.e. a language model and an acoustic model, autonomously in an unsupervised manner. To achieve this, the nonparametric Bayesian double articulation analyzer (NPB-DAA) with the deep sparse autoencoder (DSAE) is proposed in this paper. The NPB-DAA has been proposed to achieve totally unsupervised direct word discovery from speech signals. However, the performance was still unsatisfactory, although it outperformed pre-existing unsupervised learning methods. In this paper, we integrate the NPB-DAA with the DSAE, which is a neural network model that can be trained in an unsupervised manner, and demonstrate its performance through an experiment about direct word discovery from auditory speech signals. The experiment shows that the combined method, the NPB-DAA with the DSAE, outperforms pre-existing unsupervised learning methods, and shows state-of-the-art performance. It is also shown that the proposed method outperforms several standard speech recognizer-based methods with true word dictionaries.

引用

页码：770 / 783

页数：14

共 50 条

[31] From Speech Signals to Semantics - Tagging Performance at Acoustic, Phonetic and Word Levels
Qian, Yao
Ubale, Rutuja
Lange, Patrick
Evanini, Keelan
Soong, Frank
[J]. 2018 11TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2018, : 280 - 284
[32] Estimating Public Speaking Anxiety from Speech Signals Using Unsupervised Transfer Learning
Feng, Kexin
Yadav, Megha
Sakib, Md Nazmus
Behzadan, Amir
Chaspari, Theodora
[J]. 2019 7TH IEEE GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING (IEEE GLOBALSIP), 2019,
[33] Unsupervised Pattern Discovery from Thematic Speech Archives Based on Multilingual Bottleneck Features
Sung, Man-Ling
Feng, Siyuan
Lee, Tan
[J]. 2018 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2018, : 1448 - 1455
[34] A FACTORIAL DEEP MARKOV MODEL FOR UNSUPERVISED DISENTANGLED REPRESENTATION LEARNING FROM SPEECH
Khurana, Sameer
Joty, Shafiq Rayhan
Ali, Ahmed
Glass, James
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6540 - 6544
[35] AN ITERATIVE DEEP LEARNING FRAMEWORK FOR UNSUPERVISED DISCOVERY OF SPEECH FEATURES AND LINGUISTIC UNITS WITH APPLICATIONS ON SPOKEN TERM DETECTION
Chung, Cheng-Tao
Tsai, Cheng-Yu
Lu, Hsiang-Hung
Liu, Chia-Hsiang
Lee, Hung-yi
Lee, Lin-Shan
[J]. 2015 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), 2015, : 245 - 251
[36] Unsupervised deep learning of foreground objects from low-rank and sparse dataset
Takeda, Keita
Sakai, Tomoya
[J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 240
[37] Unsupervised phoneme and word acquisition from continuous speech based on a hierarchical probabilistic generative model
Nagano M.
Nakamura T.
[J]. Advanced Robotics, 2023, 37 (19) : 1253 - 1265
[38] Semantic Retrieval of Personal Photos using a Deep Autoencoder Fusing Visual Features with Speech Annotations Represented as Word/Paragraph Vectors
Lu, Hung-tsung
Liou, Yuan-ming
Lee, Hung-yi
Lee, Lin-shan
[J]. 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 140 - 144
[39] A multi-modal unsupervised fault detection system based on power signals and thermal imaging via deep AutoEncoder neural network
Cordoni, Francesco
Bacchiega, Gianluca
Bondani, Giulio
Radu, Robert
Muradore, Riccardo
[J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2022, 110
[40] Random Deep Belief Networks for Recognizing Emotions from Speech Signals
Wen, Guihua
Li, Huihui
Huang, Jubing
Li, Danyang
Xun, Eryang
[J]. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2017, 2017

← 1 2 3 4 5 →