Deep Speaker Embedding for Speaker-Targeted Automatic Speech Recognition

被引：0

作者：

Chao, Guan-Lin ^{[1
]}

Shen, John Paul ^{[1
]}

Lane, Ian ^{[2
]}

机构：

[1] Carnegie Mellon Univ, Elect & Comp Engn, 5000 Forbes Ave, Pittsburgh, PA 15213 USA

[2] Carnegie Mellon Univ, Elect & Comp Engn, Language Technol Inst, 5000 Forbes Ave, Pittsburgh, PA 15213 USA

来源：

NLPIR 2019: 2019 3RD INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL | 2019年

关键词：

speaker-targeted speech recognition; robust speaker embeddings; acoustic modeling;

D O I：

10.1145/3342827.3342847

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this work, we investigate three types of deep speaker embedding as text-independent features for speaker-targeted speech recognition in cocktail party environments. The text-independent speaker embedding is extracted from the target speaker's existing speech segment (i-vector and x-vector) or face image (f-vector), which is concatenated with acoustic features of any new speech utterances as input features. Since the proposed model extracts the speaker embedding of the target speaker once and for all, it is computationally more efficient than many prior approaches which estimate the target speaker's characteristics on the fly. Empirical evaluation shows that using speaker embedding along with acoustic features improves Word Error Rate over the audio-only model, from 65.7% to 29.5%. Among the three types of speaker embedding, x-vector and f-vector show robustness against environment variations while i-vector tends to overfit to the specific speaker and environment condition.

引用

页码：39 / 43

页数：5

共 50 条

[1] TEnet: target speaker extraction network with accumulated speaker embedding for automatic speech recognition
Li, Wenjie
Zhang, Pengyuan
Yan, Yonghong
[J]. ELECTRONICS LETTERS, 2019, 55 (14) : 816 - 818
[2] Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments
Chao, Guan-Lin
Chan, William
Lane, Ian
[J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 2120 - 2124
[3] ADAPTING TO THE SPEAKER IN AUTOMATIC SPEECH RECOGNITION
TALBOT, M
[J]. INTERNATIONAL JOURNAL OF MAN-MACHINE STUDIES, 1987, 27 (04): : 449 - 457
[4] SPEAKER-TARGETED AUDIO-VISUAL SPEECH RECOGNITION USING A HYBRID CTC/ATTENTION MODEL WITH INTERFERENCE LOSS
Tsunoda, Ryota
Aihara, Ryo
Takashima, Ryoichi
Takiguchi, Tetsuya
Imai, Yoshie
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 251 - 255
[5] Automatic speaker recognition with crosslanguage speech material
Kuenzel, Hermann J.
[J]. INTERNATIONAL JOURNAL OF SPEECH LANGUAGE AND THE LAW, 2013, 20 (01) : 21 - 44
[6] SIMILARITY MEASURE FOR AUTOMATIC SPEECH AND SPEAKER RECOGNITION
SCHROEDER, MR
[J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1968, 43 (02): : 375 - +
[7] ATTENTION MECHANISM IN SPEAKER RECOGNITION: WHAT DOES IT LEARN IN DEEP SPEAKER EMBEDDING?
Wang, Qiongqiong
Okabe, Koji
Lee, Kong Aik
Yamamoto, Hitoshi
Koshinaka, Takafumi
[J]. 2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 1052 - 1059
[8] On the Use of Speaker Information for Automatic Speech Recognition in Speaker-imbalanced Corpora
Soky, Kak
Li, Sheng
Mimura, Masato
Chu, Chenhui
Kawahara, Tatsuya
[J]. 2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 433 - 437
[9] DOMAIN ROBUST DEEP EMBEDDING LEARNING FOR SPEAKER RECOGNITION
Hu, Hang-Rui
Song, Yan
Liu, Ying
Dai, Li-Rong
McLoughlin, Ian
Liu, Lin
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7182 - 7186
[10] AUTOMATIC SPEAKER AUTHENTICATION USING SPEECH RECOGNITION TECHNIQUES
MEEKER, WF
MARTIN, TB
HERSCHER, MB
PHYFE, D
WEINSTOCK, M
[J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1967, 42 (05): : 1182 - &

← 1 2 3 4 5 →