Deep Speaker Embedding for Speaker-Targeted Automatic Speech Recognition

被引:0
|
作者
Chao, Guan-Lin [1 ]
Shen, John Paul [1 ]
Lane, Ian [2 ]
机构
[1] Carnegie Mellon Univ, Elect & Comp Engn, 5000 Forbes Ave, Pittsburgh, PA 15213 USA
[2] Carnegie Mellon Univ, Elect & Comp Engn, Language Technol Inst, 5000 Forbes Ave, Pittsburgh, PA 15213 USA
关键词
speaker-targeted speech recognition; robust speaker embeddings; acoustic modeling;
D O I
10.1145/3342827.3342847
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this work, we investigate three types of deep speaker embedding as text-independent features for speaker-targeted speech recognition in cocktail party environments. The text-independent speaker embedding is extracted from the target speaker's existing speech segment (i-vector and x-vector) or face image (f-vector), which is concatenated with acoustic features of any new speech utterances as input features. Since the proposed model extracts the speaker embedding of the target speaker once and for all, it is computationally more efficient than many prior approaches which estimate the target speaker's characteristics on the fly. Empirical evaluation shows that using speaker embedding along with acoustic features improves Word Error Rate over the audio-only model, from 65.7% to 29.5%. Among the three types of speaker embedding, x-vector and f-vector show robustness against environment variations while i-vector tends to overfit to the specific speaker and environment condition.
引用
收藏
页码:39 / 43
页数:5
相关论文
共 50 条
  • [1] TEnet: target speaker extraction network with accumulated speaker embedding for automatic speech recognition
    Li, Wenjie
    Zhang, Pengyuan
    Yan, Yonghong
    [J]. ELECTRONICS LETTERS, 2019, 55 (14) : 816 - 818
  • [2] Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments
    Chao, Guan-Lin
    Chan, William
    Lane, Ian
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 2120 - 2124
  • [3] ADAPTING TO THE SPEAKER IN AUTOMATIC SPEECH RECOGNITION
    TALBOT, M
    [J]. INTERNATIONAL JOURNAL OF MAN-MACHINE STUDIES, 1987, 27 (04): : 449 - 457
  • [4] SPEAKER-TARGETED AUDIO-VISUAL SPEECH RECOGNITION USING A HYBRID CTC/ATTENTION MODEL WITH INTERFERENCE LOSS
    Tsunoda, Ryota
    Aihara, Ryo
    Takashima, Ryoichi
    Takiguchi, Tetsuya
    Imai, Yoshie
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 251 - 255
  • [5] Automatic speaker recognition with crosslanguage speech material
    Kuenzel, Hermann J.
    [J]. INTERNATIONAL JOURNAL OF SPEECH LANGUAGE AND THE LAW, 2013, 20 (01) : 21 - 44
  • [6] SIMILARITY MEASURE FOR AUTOMATIC SPEECH AND SPEAKER RECOGNITION
    SCHROEDER, MR
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1968, 43 (02): : 375 - +
  • [7] ATTENTION MECHANISM IN SPEAKER RECOGNITION: WHAT DOES IT LEARN IN DEEP SPEAKER EMBEDDING?
    Wang, Qiongqiong
    Okabe, Koji
    Lee, Kong Aik
    Yamamoto, Hitoshi
    Koshinaka, Takafumi
    [J]. 2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 1052 - 1059
  • [8] On the Use of Speaker Information for Automatic Speech Recognition in Speaker-imbalanced Corpora
    Soky, Kak
    Li, Sheng
    Mimura, Masato
    Chu, Chenhui
    Kawahara, Tatsuya
    [J]. 2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 433 - 437
  • [9] DOMAIN ROBUST DEEP EMBEDDING LEARNING FOR SPEAKER RECOGNITION
    Hu, Hang-Rui
    Song, Yan
    Liu, Ying
    Dai, Li-Rong
    McLoughlin, Ian
    Liu, Lin
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7182 - 7186
  • [10] AUTOMATIC SPEAKER AUTHENTICATION USING SPEECH RECOGNITION TECHNIQUES
    MEEKER, WF
    MARTIN, TB
    HERSCHER, MB
    PHYFE, D
    WEINSTOCK, M
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1967, 42 (05): : 1182 - &