Directed Speech Separation for Automatic Speech Recognition of Long-form Conversational Speech

被引:1
|
作者
Paturi, Rohit [1 ]
Srinivasan, Sundararajan [1 ]
Kirchhoff, Katrin [1 ]
Romero, Daniel Garcia [1 ]
机构
[1] Amazon AWS AI, Washington, DC 20052 USA
来源
关键词
Speech Separation; Speaker embeddings; Spectral clustering; ASR; deep learning;
D O I
10.21437/Interspeech.2022-10843
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Many of the recent advances in speech separation are primarily aimed at synthetic mixtures of short audio utterances with high degrees of overlap. Most of these approaches need an additional stitching step to stitch the separated speech chunks for long form audio. Since most of the approaches involve Permutation Invariant training (PIT), the order of separated speech chunks is nondeterministic and leads to difficulty in accurately stitching homogenous speaker chunks for downstream tasks like Automatic Speech Recognition (ASR). Also, most of these models are trained with synthetic mixtures and do not generalize to real conversational data. In this paper, we propose a speaker conditioned separator trained on speaker embeddings extracted directly from the mixed signal using an over-clustering based approach. This model naturally regulates the order of the separated chunks without the need for an additional stitching step. We also introduce a data sampling strategy with real and synthetic mixtures which generalizes well to real conversation speech. With this model and data sampling technique, we show significant improvements in speaker-attributed word error rate (SA-WER) on Hub5 data.
引用
收藏
页码:5388 / 5392
页数:5
相关论文
共 50 条
  • [1] PARTIALLY OVERLAPPED INFERENCE FOR LONG-FORM SPEECH RECOGNITION
    Kang, Tae Gyoon
    Kim, Ho-Gyeong
    Lee, Min-Joong
    Lee, Jihyun
    Lee, Hoshik
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5989 - 5993
  • [2] Generating and evaluating segmentations for automatic speech recognition of conversational telephone speech
    Tranter, SE
    Yu, K
    Evermann, G
    Woodland, RC
    [J]. 2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING, 2004, : 753 - 756
  • [3] Pronunciation change in conversational speech and its implications for automatic speech recognition
    Saraçlar, M
    Khudanpur, S
    [J]. COMPUTER SPEECH AND LANGUAGE, 2004, 18 (04): : 375 - 395
  • [4] A COMPARISON OF END-TO-END MODELS FOR LONG-FORM SPEECH RECOGNITION
    Chiu, Chung-Cheng
    Han, Wei
    Zhang, Yu
    Pang, Ruoming
    Kishchenko, Sergey
    Nguyen, Patrick
    Narayanan, Arun
    Liao, Hank
    Zhang, Shuyuan
    Kannan, Anjuli
    Prabhavalkar, Rohit
    Chen, Zhifeng
    Sainath, Tara
    Wu, Yonghui
    [J]. 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 889 - 896
  • [5] Evolution of the performance of automatic speech recognition algorithms in transcribing conversational telephone speech
    Padmanabhan, M
    Saon, G
    Zweig, G
    Huang, J
    Kingsbury, B
    Mangu, L
    [J]. IMTC/2001: PROCEEDINGS OF THE 18TH IEEE INSTRUMENTATION AND MEASUREMENT TECHNOLOGY CONFERENCE, VOLS 1-3: REDISCOVERING MEASUREMENT IN THE AGE OF INFORMATICS, 2001, : 1926 - 1931
  • [6] Chameleon: A Language Model Adaptation Toolkit for Automatic Speech Recognition of Conversational Speech
    Song, Yuanfeng
    Jiang, Di
    Zhao, Weiwei
    Xu, Qian
    Wong, Raymond Chi-Wing
    Yang, Qiang
    [J]. 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF SYSTEM DEMONSTRATIONS, 2019, : 37 - 42
  • [7] SPEECH SEPARATION FOR SPEECH RECOGNITION
    DECHEVEIGNE, A
    HAWAHARA, H
    AIKAWA, K
    LEA, A
    [J]. JOURNAL DE PHYSIQUE IV, 1994, 4 (C5): : 545 - 548
  • [8] Speech production and automatic speech recognition
    [J]. Acoustics Bulletin, 2000, 25 (02):
  • [9] AUTOMATIC SPEECH RECOGNITION OF IMPAIRED SPEECH
    CARLSON, GS
    BERNSTEIN, J
    [J]. INTERNATIONAL JOURNAL OF REHABILITATION RESEARCH, 1988, 11 (04) : 396 - 398
  • [10] Fundamental Frequency of Child-Directed Speech Using Automatic Speech Recognition
    VanDam, Mark
    De Palma, Paul
    [J]. 2014 JOINT 7TH INTERNATIONAL CONFERENCE ON SOFT COMPUTING AND INTELLIGENT SYSTEMS (SCIS) AND 15TH INTERNATIONAL SYMPOSIUM ON ADVANCED INTELLIGENT SYSTEMS (ISIS), 2014, : 1349 - 1353