Investigating Raw Wave Deep Neural Networks for End-to-End Speaker Spoofing Detection

被引:41
|
作者
Dinkel, Heinrich [1 ]
Qian, Yanmin [1 ]
Yu, Kai [1 ]
机构
[1] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, Shanghai 200240, Peoples R China
关键词
Deep learning; end-to-end; speaker verification; spoofing detection; VERIFICATION; COUNTERMEASURES; BIOMETRICS; FEATURES;
D O I
10.1109/TASLP.2018.2851155
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Recent advances in automatic speaker verification (ASV) lead to an increased interest in securing these systems for real-world applications. Malicious spoofing attempts against ASV systems can lead to serious security breaches. A spoofing attack within the context of ASV is a condition in which a (potentially harmful) person successfully masks as another, to the ASV system already known person by falsifying or manipulating data. While most previous work focuses on enhanced, spoof-aware features, end-to-end models can be a potential alternative. In this paper, we investigate the training of a raw wave front-ends for deep convolutional, long short-term memory (LSTM) and vanilla neural networks, which are analyzed for their suitability toward spoofing detection, regarding the influence of frame size, number of output neurons, and sequence length. A joint convolutional LSTM neural network (CLDNN) is proposed, which outperforms previous attempts on the BTAS2016 dataset (0.82% -> 0.19% HTER), placing itself as the current state-of-the-art model for the dataset. We show that end-to-end approaches a re appropriate for the important replay detection task and show that the proposed model is capable of distinguishing device-invariant spoofing attempts. Regarding the ASVspoof2015 dataset, the end-to-end solution achieves an equal error rate (ERR) of 0.00% for the S1-S9 conditions. We show that the end-to-end approach based on a raw waveform input can outperform common cepstral features, without the use of context-dependent frame extensions. In addition, a cross-database (domain mismatch) scenario is also evaluated, which shows that the proposed CLDNN model trained on the BTAS2016 dataset achieves an EER of 25.7% on the ASVspoof2015 dataset.
引用
收藏
页码:2002 / 2014
页数:13
相关论文
共 50 条
  • [1] END-TO-END SPOOFING DETECTION WITH RAW WAVEFORM CLDNNS
    Dinkel, Heinrich
    Chen, Nanxin
    Qian, Yanmin
    Yu, Kai
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 4860 - 4864
  • [2] A COMPLETE END-TO-END SPEAKER VERIFICATION SYSTEM USING DEEP NEURAL NETWORKS: FROM RAW SIGNALS TO VERIFICATION RESULT
    Jung, Jee-Weon
    Heo, Hee-Soo
    Yang, Il-Ho
    Shim, Hye-Jin
    Yu, Ha-Jin
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5349 - 5353
  • [3] END-TO-END OVERLAPPED SPEECH DETECTION AND SPEAKER COUNTING WITH RAW WAVEFORM
    Zhang, Wangyou
    Sun, Man
    Wang, Lan
    Qian, Yanmin
    [J]. 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 660 - 666
  • [4] Towards End-to-End ECG Classification With Raw Signal Extraction and Deep Neural Networks
    Xu, Sean Shensheng
    Mak, Man-Wai
    Cheung, Chi-Chung
    [J]. IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2019, 23 (04) : 1574 - 1584
  • [5] End-to-End Speaker Identification in Noisy and Reverberant Environments Using Raw Waveform Convolutional Neural Networks
    Salvati, Daniele
    Drioli, Carlo
    Foresti, Gian Luca
    [J]. INTERSPEECH 2019, 2019, : 4335 - 4339
  • [6] DEEP NEURAL NETWORK-BASED SPEAKER EMBEDDINGS FOR END-TO-END SPEAKER VERIFICATION
    Snyder, David
    Ghahremani, Pegah
    Povey, Daniel
    Garcia-Romero, Daniel
    Carmiel, Yishay
    Khudanpur, Sanjeev
    [J]. 2016 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2016), 2016, : 165 - 170
  • [7] End-to-End Premature Ventricular Contraction Detection Using Deep Neural Networks
    Kraft, Dimitri
    Bieber, Gerald
    Jokisch, Peter
    Rumm, Peter
    [J]. SENSORS, 2023, 23 (20)
  • [8] An End-to-End Approach for Seam Carving Detection Using Deep Neural Networks
    Moreira, Thierry P.
    Santana, Marcos Cleison S.
    Passos, Leandro A.
    Papa, Joao Paulo
    da Costa, Kelton Augusto P.
    [J]. PATTERN RECOGNITION AND IMAGE ANALYSIS (IBPRIA 2022), 2022, 13256 : 447 - 457
  • [9] Tied Hidden Factors in Neural Networks for End-to-End Speaker Recognition
    Miguel, Antonio
    Llombart, Jorge
    Ortega, Alfonso
    Lleida, Eduardo
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2819 - 2823
  • [10] End-to-End Active Speaker Detection
    Alcazar, Juan Leon
    Cordes, Moritz
    Zhao, Chen
    Ghanem, Bernard
    [J]. COMPUTER VISION, ECCV 2022, PT XXXVII, 2022, 13697 : 126 - 143