Conformer-Based Speaker Recognition Model for Real-Time Multi-Scenarios

被引:0
|
作者
Xuan, Xi [1 ]
Han, Runping [2 ]
Gao, Jingxin [1 ]
机构
[1] School of Arts and Sciences, Beijing Institute of Fashion Technology, Beijing,100029, China
[2] School of Fashion, Beijing Institute of Fashion Technology, Beijing,100029, China
关键词
Real time systems - Speech recognition;
D O I
暂无
中图分类号
学科分类号
摘要
To handle the problems of poor performances of speaker verification systems, appearing in multiple scenarios with cross-domain utterances, long-duration utterances and noisy utterances, a real-time robust speaker recognition model, PMS-Conformer, is designed based on Conformer in this paper. The architecture of the PMS-Conformer is inspired by the state-of-the-art model named MFA-Conformer. PMS-Conformer has made the improvements on the acoustic feature extractor, network components and loss calculation module of MFA-Conformer respectively, having the novel and effective acoustic feature extractor and the robust speaker embedding extractor with high generalization capability. PMS-Conformer is trained on VoxCeleb1&2 dataset, and it is compared with the baseline MFA-Conformer and ECAPA-TDNN, and extensive comparison experiments are conducted on the speaker verification tasks. The experimental results show that on VoxMovies with cross-domain utterances, SITW with long-duration utterances and VoxCeleb-O processed by adding noise to its utterances, the ASV system built with PMS-Conformer is more competitive than those built with MFA-Conformer and ECAPA-TDNN respectively. Moreover, the trainable Params and RTF of the speaker embedding extractor of PMS-Conformer are significantly lower than those of ECAPA-TDNN. All evaluation experiment results demonstrate that PMS-Conformer exhibits good performances in real-time multi-scenarios. © 2024 Journal of Computer Engineering and Applications Beijing Co., Ltd.; Science Press. All rights reserved.
引用
收藏
页码:147 / 156
相关论文
共 50 条
  • [31] Real-time multi-agent systems for telerehabilitation scenarios
    Calvaresi, Davide
    Marinoni, Mauro
    Dragoni, Aldo Franco
    Hilfiker, Roger
    Schumacher, Michael
    ARTIFICIAL INTELLIGENCE IN MEDICINE, 2019, 96 (217-231) : 217 - 231
  • [32] Multi-robot coalition formation in real-time scenarios
    Guerrero, Jose
    Oliver, Gabriel
    ROBOTICS AND AUTONOMOUS SYSTEMS, 2012, 60 (10) : 1295 - 1307
  • [33] Speaker pruning algorithm for real-time speaker identification
    Kinnunen, T
    Karpov, E
    Fränti, P
    AUDIO-BASED AND VIDEO-BASED BIOMETRIC PERSON AUTHENTICATION, PROCEEDINGS, 2003, 2688 : 639 - 646
  • [34] Real-time Speaker Recognition System using Multi-stream i-vectors for AI Assistant
    Cho, Keunseok
    Roh, Jaeyoung
    Han, Youngho
    Kim, Namhoon
    Lee, Jaewon
    2018 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS (ICCE), 2018,
  • [35] Comparison of real-time multi-speaker neural vocoders on CPUs
    Matsubara, Keisuke
    Okamoto, Takuma
    Takashima, Ryoichi
    Takiguchi, Tetsuya
    Toda, Tomoki
    Kawai, Hisashi
    ACOUSTICAL SCIENCE AND TECHNOLOGY, 2022, 43 (02) : 121 - 124
  • [36] Confidence Score Based Speaker Adaptation of Conformer Speech Recognition Systems
    Deng, Jiajun
    Xie, Xurong
    Wang, Tianzi
    Cui, Mingyu
    Xue, Boyang
    Jin, Zengrui
    Li, Guinan
    Hu, Shujie
    Liu, Xunying
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 1175 - 1190
  • [37] EAD-CONFORMER: A CONFORMER-BASED ENCODER-ATTENTION-DECODER-NETWORK FOR MULTI-TASK AUDIO SOURCE SEPARATION
    Li, Chenxing
    Wang, Yang
    Deng, Feng
    Zhang, Zhuo
    Wang, Xiaorui
    Wang, Zhongyuan
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 521 - 525
  • [38] Real-Time Recognition of Percussive Sounds by a Model-Based Method
    Simsekli, Umut
    Jylha, Antti
    Erkut, Cumhur
    Cemgil, Taylan
    EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, 2011,
  • [39] Real-Time Recognition of Percussive Sounds by a Model-Based Method
    Umut Şimşekli
    Antti Jylhä
    Cumhur Erkut
    A. Taylan Cemgil
    EURASIP Journal on Advances in Signal Processing, 2011
  • [40] Speaker Adaptive Real-Time Korean Single Vowel Recognition for an Animation Producing
    Whang, Sun-Min
    Song, Bok-Hee
    Yun, Han-Kyung
    FRONTIER AND INNOVATION IN FUTURE COMPUTING AND COMMUNICATIONS, 2014, 301 : 633 - 641