COMPACT NETWORK FOR SPEAKERBEAM TARGET SPEAKER EXTRACTION

被引：0

作者：

Delcroix, Marc ^{[1
]}

Zmolikova, Katerina ^{[2
]}

Ochiai, Tsubasa ^{[1
]}

Kinoshita, Keisuke ^{[1
]}

Araki, Shoko ^{[1
]}

Nakatani, Tomohiro ^{[1
]}

机构：

[1] NTT Corp, NTT Commun Sci Labs, Kyoto, Japan

[2] Brno Univ Technol, Speech FIT & IT4I Ctr Excellence, Brno, Czech Republic

来源：

2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2019年

关键词：

Target speech extraction; Neural network; Adaptation; Auxiliary feature; Speech enhancement; SPEECH SEPARATION;

D O I：

10.1109/icassp.2019.8683087

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Speech separation that separates a mixture of speech signals into each of its sources has been an active research topic for a long time and has seen recent progress with the advent of deep learning. A related problem is target speaker extraction, i. e. extraction of only speech of a target speaker out of a mixture, given characteristics of his/ her voice. We have recently proposed SpeakerBeam, which is a neural network-based target speaker extraction method. SpeakerBeam uses a speech extraction network that is adapted to the target speaker using auxiliary features derived from an adaptation utterance of that speaker. Initially, we implemented SpeakerBeam with a factorized adaptation layer, which consists of several parallel linear transformations weighted by weights derived from the auxiliary features. The factorized layer is effective for target speech extraction, but it requires a large number of parameters. In this paper, we propose to simply scale the activations of a hidden layer of the speech extraction network with weights derived from the auxiliary features. This simpler approach greatly reduces the number of model parameters by up to 60%, making it much more practical, while maintaining a similar level of performance. We tested our approach on simulated and real noisy and reverberant mixtures, showing the potential of SpeakerBeam for real-life applications. Moreover, we showed that speech extraction performance of SpeakerBeam compares favorably with that of a state-of-the-art speech separation method with a similar network configuration.

引用

页码：6965 / 6969

页数：5

共 50 条

[21] Variants of LSTM cells for single-channel speaker-conditioned target speaker extraction
Ragini Sinha
Christian Rollwage
Simon Doclo
EURASIP Journal on Audio, Speech, and Music Processing, 2024 (1)
[22] A Pitch-aware Speaker Extraction Serial Network
Jiang, Yu
Ge, Meng
Wang, Longbiao
Dang, Jianwu
Honda, Kiyoshi
Zhang, Sulin
Yu, Bo
2020 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2020, : 616 - 620
[23] MULTI-CHANNEL TARGET SPEECH EXTRACTION WITH CHANNEL DECORRELATION AND TARGET SPEAKER ADAPTATION
Han, Jiangyu
Zhou, Xinyuan
Long, Yanhua
Li, Yijie
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6094 - 6098
[24] Speaker-aware neural network based beamformer for speaker extraction in speech mixtures
Zmplikova, Katerina
Delcroix, Marc
Kinoshita, Keisuke
Higuchi, Takuya
Ogawa, Atsunori
Nakatani, Tomohiro
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2655 - 2659
[25] End-to-end SpeakerBeam for single channel target speech recognition
Delcroix, Marc
Watanabe, Shinji
Ochiai, Tsubasa
Kinoshita, Keisuke
Karita, Shigeki
Ogawa, Atsunori
Nakatani, Tomohiro
INTERSPEECH 2019, 2019, : 451 - 455
[26] Target Confusion in End-to-end Speaker Extraction: Analysis and Approaches
Zhao, Zifeng
Yang, Dongchao
Gu, Rongzhi
Zhang, Haoran
Zou, Yuexian
INTERSPEECH 2022, 2022, : 5333 - 5337
[27] Target Speaker Extraction for Customizable Query-by-Example Keyword Spotting
Shao, Qijie
Hou, Jingyong
Hu, Yanxin
Wang, Qing
Xie, Lei
Lei, Xin
2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 672 - 678
[28] IMPROVING RNN TRANSDUCER WITH TARGET SPEAKER EXTRACTION AND NEURAL UNCERTAINTY ESTIMATION
Shi, Jiatong
Zhang, Chunlei
Weng, Chao
Watanabe, Shinji
Yu, Meng
Yu, Dong
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6908 - 6912
[29] Speech extraction of a target speaker from one mixed speech signal
Azetsu, Tadahiro
Uchino, Eiji
Suetake, Noriaki
IEEJ Transactions on Electronics, Information and Systems, 2007, 127 (06) : 970 - 971
[30] Contrastive Learning for Target Speaker Extraction With Attention-Based Fusion
Li, Xiao
Liu, Ruirui
Huang, Huichou
Wu, Qingyao
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 178 - 188

← 1 2 3 4 5 →