Self-Attention Encoding and Pooling for Speaker Recognition

被引：35

作者：

Safari, Pooyan ^{[1
]}

India, Miquel ^{[1
]}

Hernando, Javier ^{[1
]}

机构：

[1] Univ Politecn Cataluna, TALP Res Ctr, Barcelona, Spain

来源：

INTERSPEECH 2020 | 2020年

关键词：

Self-Attention Encoding; Self-Attention Pooling; Speaker Verification; Speaker Embedding;

D O I：

10.21437/Interspeech.2020-1446

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

The computing power of mobile devices limits the end-user applications in terms of storage size, processing, memory and energy consumption. These limitations motivate researchers for the design of more efficient deep models. On the other hand, self-attention networks based on Transformer architecture have attracted remarkable interests due to their high parallelization capabilities and strong performance on a variety of Natural Language Processing (NLP) applications. Inspired by the Transformer, we propose a tandem Self-Attention Encoding and Pooling (SAEP) mechanism to obtain a discriminative speaker embedding given non-fixed length speech utterances. SAEP is a stack of identical blocks solely relied on self-attention and position-wise feed-forward networks to create vector representation of speakers. This approach encodes short-term speaker spectral features into speaker embeddings to be used in text-independent speaker verification. We have evaluated this approach on both VoxCeleb1 & 2 datasets. The proposed architecture is able to outperform the baseline x-vector, and shows competitive performance to some other benchmarks based on convolutions, with a significant reduction in model size. It employs 94%, 95%, and 73% less parameters compared to ResNet-34, ResNet-50, and x-vector, respectively. This indicates that the proposed fully attention based architecture is more efficient in extracting time-invariant features from speaker utterances.

引用

页码：941 / 945

页数：5

共 50 条

[31] Point cloud upsampling network based on pyramid pooling and self-attention mechanism
Yang, Xiaoping
Chen, Fei
Li, Zhenhua
Liu, Guanghui
ADVANCES IN CONTINUOUS AND DISCRETE MODELS, 2024, 2024 (01):
[32] Multi-Stride Self-Attention for Speech Recognition
Han, Kyu J.
Huang, Jing
Tang, Yun
He, Xiaodong
Zhou, Bowen
INTERSPEECH 2019, 2019, : 2788 - 2792
[33] SELF-ATTENTION GUIDED DEEP FEATURES FOR ACTION RECOGNITION
Xiao, Renyi
Hou, Yonghong
Guo, Zihui
Li, Chuankun
Wang, Pichao
Li, Wanqing
2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 1060 - 1065
[34] Context Matters: Self-Attention for Sign Language Recognition
Slimane, Fares Ben
Bouguessa, Mohamed
2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 7884 - 7891
[35] ESAformer: Enhanced Self-Attention for Automatic Speech Recognition
Li, Junhua
Duan, Zhikui
Li, Shiren
Yu, Xinmei
Yang, Guangguang
IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 471 - 475
[36] A lightweight transformer with linear self-attention for defect recognition
Zhai, Yuwen
Li, Xinyu
Gao, Liang
Gao, Yiping
ELECTRONICS LETTERS, 2024, 60 (17)
[37] Residential load forecasting based on LSTM fusing self-attention mechanism with pooling
Zang, Haixiang
Xu, Ruiqi
Cheng, Lilin
Ding, Tao
Liu, Ling
Wei, Zhinong
Sun, Guoqiang
ENERGY, 2021, 229
[38] NEPALI SPEECH RECOGNITION USING SELF-ATTENTION NETWORKS
Joshi, Basanta
Shrestha, Rupesh
INTERNATIONAL JOURNAL OF INNOVATIVE COMPUTING INFORMATION AND CONTROL, 2023, 19 (06): : 1769 - 1784
[39] Finger Vein Recognition Based on ResNet With Self-Attention
Zhang, Zhibo
Chen, Guanghua
Zhang, Weifeng
Wang, Huiyang
IEEE ACCESS, 2024, 12 : 1943 - 1951
[40] Multimodal cooperative self-attention network for action recognition
Zhong, Zhuokun
Hou, Zhenjie
Liang, Jiuzhen
Lin, En
Shi, Haiyong
IET IMAGE PROCESSING, 2023, 17 (06) : 1775 - 1783

← 1 2 3 4 5 →