Towards Understanding Attention-Based Speech Recognition Models

被引:4
|
作者
Qin, Chu-Xiong [1 ]
Qu, Dan [1 ]
机构
[1] PLA Strateg Support Force Informat Engn Univ, Dept Informat & Syst Engn, Zhengzhou 450001, Peoples R China
基金
中国国家自然科学基金;
关键词
Attention-based model; t-distributed stochastic neighbor embedding; canonical correlation analysis; NEURAL-NETWORKS;
D O I
10.1109/ACCESS.2020.2970758
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Although the attention-based speech recognition has achieved promising performances, the specific explanation of the intermediate representations remains a black box theory. In this paper, we use the method to visually show and explain continuous encoder outputs. We propose a human-intervened force alignment method to obtain labels for t-distributed stochastic neighbor embedding (t-SNE), and use them to better understand the attention mechanism and the recurrent representations. In addition, we combine t-SNE and canonical correlation analysis (CCA) to analyze the training dynamics of phones in the attention-based model. Experiments are carried on TIMIT and WSJ respectively. The aligned embeddings of the encoder outputs could form sequence manifolds of the ground truth labels. Figures of t-SNE embeddings visually show what representations the encoder shaped into and how the attention mechanism works for the speech recognition. The comparisons between different models, different layers, and different lengths of the utterance show that manifolds are clearer in the shape when outputs are from the deeper layer of the encoder, the shorter utterance, and models with better performances. We also observe that the same symbols from different utterances tend to gather at similar positions, which proves the consistency of our method. Further comparisons are taken between different epochs of the model using t-SNE and CCA. The results show that both the plosive and the nasal/flap phones converge quickly, while the long vowel phone converge slowly.
引用
收藏
页码:24358 / 24369
页数:12
相关论文
共 50 条
  • [1] Attention-Based Models for Speech Recognition
    Chorowski, Jan
    Bahdanau, Dzmitry
    Serdyuk, Dmitriy
    Cho, Kyunghyun
    Bengio, Yoshua
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 28 (NIPS 2015), 2015, 28
  • [2] Lattice generation in attention-based speech recognition models
    Zapotoczny, Michal
    Pietrzak, Piotr
    Lancucki, Adrian
    Chorowski, Jan
    [J]. INTERSPEECH 2019, 2019, : 2225 - 2229
  • [3] An Online Attention-Based Model for Speech Recognition
    Fan, Ruchao
    Zhou, Pan
    Chen, Wei
    Jia, Jia
    Liu, Gang
    [J]. INTERSPEECH 2019, 2019, : 4390 - 4394
  • [4] CONFIDENCE ESTIMATION FOR ATTENTION-BASED SEQUENCE-TO-SEQUENCE MODELS FOR SPEECH RECOGNITION
    Li, Qiujia
    Qiu, David
    Zhang, Yu
    Li, Bo
    He, Yanzhang
    Woodland, Philip C.
    Cao, Liangliang
    Strohman, Trevor
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6388 - 6392
  • [5] Siamese Attention-Based LSTM for Speech Emotion Recognition
    Nizamidin, Tashpolat
    Zhao, Li
    Liang, Ruiyu
    Xie, Yue
    Hamdulla, Askar
    [J]. IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES, 2020, E103A (07) : 937 - 941
  • [6] ATTENTION-BASED SPEECH RECOGNITION USING GAZE INFORMATION
    Segawa, Osamu
    Hayashi, Tomoki
    Takeda, Kazuya
    [J]. 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 465 - 470
  • [7] Attention-Based Dense LSTM for Speech Emotion Recognition
    Xie, Yue
    Liang, Ruiyu
    Liang, Zhenlin
    Zhao, Li
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2019, E102D (07): : 1426 - 1429
  • [8] STREAMING ATTENTION-BASED MODELS WITH AUGMENTED MEMORY FOR END-TO-END SPEECH RECOGNITION
    Yeh, Ching-Feng
    Wang, Yongqiang
    Shi, Yangyang
    Wu, Chunyang
    Zhang, Frank
    Chan, Julian
    Seltzer, Michael L.
    [J]. 2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 8 - 14
  • [9] Towards Efficiently Learning Monotonic Alignments for Attention-Based End-to-End Speech Recognition
    Miao, Chenfeng
    Zou, Kun
    Zhuang, Ziyang
    Wei, Tao
    Ma, Jun
    Wang, Shaojun
    Xiao, Jing
    [J]. INTERSPEECH 2022, 2022, : 1051 - 1055
  • [10] Effective Exploitation of Posterior Information for Attention-Based Speech Recognition
    Tang, Jian
    Hou, Junfeng
    Song, Yan
    Dai, Li-Rong
    McLoughlin, Ian
    [J]. IEEE ACCESS, 2020, 8 : 108988 - 108999