Revisiting joint decoding based multi-talker speech recognition with DNN acoustic model

被引:1
|
作者
Kocour, Martin [1 ]
Zmolikova, Katerina [1 ]
Ondel, Lucas [1 ,3 ]
Svec, Jan [1 ]
Delcroix, Marc [2 ]
Ochiai, Tsubasa [2 ]
Burget, Lukas [1 ]
Cernocky, Jan Honza [1 ]
机构
[1] Brno Univ Technol, Fac Informat Technol, Speech FIT, Brno, Czech Republic
[2] NTT Corp, Tokyo, Japan
[3] Univ Paris Saclay, LISN, CNRS, Gif Sur Yvette, France
来源
关键词
Multi-talker speech recognition; Permutation invariant training; Factorial Hidden Markov models; DEEP NEURAL-NETWORKS;
D O I
10.21437/Interspeech.2022-10406
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In typical multi-talker speech recognition systems, a neural network-based acoustic model predicts senone state posteriors for each speaker. These are later used by a single-talker decoder which is applied on each speaker-specific output stream separately. In this work, we argue that such a scheme is sub-optimal and propose a principled solution that decodes all speakers jointly. We modify the acoustic model to predict joint state posteriors for all speakers, enabling the network to express uncertainty about the attribution of parts of the speech signal to the speakers. We employ a joint decoder that can make use of this uncertainty together with higher-level language information. For this, we revisit decoding algorithms used in factorial generative models in early multi-talker speech recognition systems. In contrast with these early works, we replace the GMM acoustic model with DNN, which provides greater modeling power and simplifies part of the inference. We demonstrate the advantage of joint decoding in proof of concept experiments on a mixed-TIDIGITS dataset.
引用
收藏
页码:4955 / 4959
页数:5
相关论文
共 50 条
  • [1] Streaming Multi-talker Speech Recognition with Joint Speaker Identification
    Lu, Liang
    Kanda, Naoyuki
    Li, Jinyu
    Gong, Yifan
    [J]. INTERSPEECH 2021, 2021, : 1782 - 1786
  • [2] Real-Time Speech Recognition in a Multi-talker Reverberated Acoustic Scenario
    Rotili, Rudy
    Principi, Emanuele
    Squartini, Stefano
    Schuller, Bjoern
    [J]. ADVANCED INTELLIGENT COMPUTING THEORIES AND APPLICATIONS: WITH ASPECTS OF ARTIFICIAL INTELLIGENCE, 2012, 6839 : 379 - +
  • [3] A Speaker-Dependent Deep Learning Approach to Joint Speech Separation and Acoustic Modeling for Multi-Talker Automatic Speech Recognition
    Tu, Yan-Hui
    Du, Jun
    Dai, Li-Rung
    Lee, Chin-Hui
    [J]. 2016 10TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2016,
  • [4] Modeling speech localization, talker identification, and word recognition in a multi-talker setting
    Josupeit, Angela
    Hohmann, Volker
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2017, 142 (01): : 35 - 54
  • [5] ACOUSTIC MODELING FOR DISTANT MULTI-TALKER SPEECH RECOGNITION WITH SINGLE- AND MULTI-CHANNEL BRANCHES
    Kanda, Naoyuki
    Fujita, Yusuke
    Horiguchi, Shota
    Ikeshita, Rintaro
    Nagamatsu, Kenji
    Watanabe, Shinji
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6630 - 6634
  • [6] SURT 2.0: Advances in Transducer-Based Multi-Talker Speech Recognition
    Raj, Desh
    Povey, Daniel
    Khudanpur, Sanjeev
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 3800 - 3813
  • [7] Monaural multi-talker speech recognition using factorial speech processing models
    Khademian, Mahdi
    Homayounpour, Mohammad Mehdi
    [J]. SPEECH COMMUNICATION, 2018, 98 : 1 - 16
  • [8] END-TO-END MULTI-TALKER OVERLAPPING SPEECH RECOGNITION
    Tripathi, Anshuman
    Lu, Han
    Sak, Hasim
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6129 - 6133
  • [9] Streaming End-to-End Multi-Talker Speech Recognition
    Lu, Liang
    Kanda, Naoyuki
    Li, Jinyu
    Gong, Yifan
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2021, 28 : 803 - 807
  • [10] Variational Loopy Belief Propagation for Multi-talker Speech Recognition
    Rennie, Steven J.
    Hershey, John R.
    Olsen, Peder A.
    [J]. INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 1367 - 1370