Investigating Raw Wave Deep Neural Networks for End-to-End Speaker Spoofing Detection

被引:41
|
作者
Dinkel, Heinrich [1 ]
Qian, Yanmin [1 ]
Yu, Kai [1 ]
机构
[1] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, Shanghai 200240, Peoples R China
关键词
Deep learning; end-to-end; speaker verification; spoofing detection; VERIFICATION; COUNTERMEASURES; BIOMETRICS; FEATURES;
D O I
10.1109/TASLP.2018.2851155
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Recent advances in automatic speaker verification (ASV) lead to an increased interest in securing these systems for real-world applications. Malicious spoofing attempts against ASV systems can lead to serious security breaches. A spoofing attack within the context of ASV is a condition in which a (potentially harmful) person successfully masks as another, to the ASV system already known person by falsifying or manipulating data. While most previous work focuses on enhanced, spoof-aware features, end-to-end models can be a potential alternative. In this paper, we investigate the training of a raw wave front-ends for deep convolutional, long short-term memory (LSTM) and vanilla neural networks, which are analyzed for their suitability toward spoofing detection, regarding the influence of frame size, number of output neurons, and sequence length. A joint convolutional LSTM neural network (CLDNN) is proposed, which outperforms previous attempts on the BTAS2016 dataset (0.82% -> 0.19% HTER), placing itself as the current state-of-the-art model for the dataset. We show that end-to-end approaches a re appropriate for the important replay detection task and show that the proposed model is capable of distinguishing device-invariant spoofing attempts. Regarding the ASVspoof2015 dataset, the end-to-end solution achieves an equal error rate (ERR) of 0.00% for the S1-S9 conditions. We show that the end-to-end approach based on a raw waveform input can outperform common cepstral features, without the use of context-dependent frame extensions. In addition, a cross-database (domain mismatch) scenario is also evaluated, which shows that the proposed CLDNN model trained on the BTAS2016 dataset achieves an EER of 25.7% on the ASVspoof2015 dataset.
引用
收藏
页码:2002 / 2014
页数:13
相关论文
共 50 条
  • [21] Neural PLDA Modeling for End-to-End Speaker Verification
    Ramoji, Shreyas
    Krishnan, Prashant
    Ganapathy, Sriram
    [J]. INTERSPEECH 2020, 2020, : 4333 - 4337
  • [22] END-TO-END DETECTION OF ATTACKS TO AUTOMATIC SPEAKER RECOGNIZERS WITH TIME-ATTENTIVE LIGHT CONVOLUTIONAL NEURAL NETWORKS
    Monteiro, Joao
    Alam, Jahangir
    Falk, Tiago H.
    [J]. 2019 IEEE 29TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2019,
  • [23] Remote Sensing Airport Detection Based on End-to-End Deep Transferable Convolutional Neural Networks
    Li, Shuai
    Xu, Yuelei
    Zhu, Mingming
    Ma, Shiping
    Tang, Hong
    [J]. IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2019, 16 (10) : 1640 - 1644
  • [24] Automating detection and localization of myocardial infarction using shallow and end-to-end deep neural networks
    Jafarian, Kamal
    Vahdat, Vahab
    Salehi, Seyedmohammad
    Mobin, Mohammadsadegh
    [J]. APPLIED SOFT COMPUTING, 2020, 93
  • [25] End-to-end deep speaker embedding learning using multi-scale attentional fusion and graph neural networks
    Kashani, Hamidreza Baradaran
    Jazmi, Siyavash
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2023, 222
  • [26] End-to-end Stereo Audio Coding Using Deep Neural Networks
    Lim, Wootaek
    Jang, Inseon
    Beack, Seungkwon
    Sung, Jongmo
    Lee, Taejin
    [J]. PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 860 - 864
  • [27] End-to-End Multimodal Emotion Recognition Using Deep Neural Networks
    Tzirakis, Panagiotis
    Trigeorgis, George
    Nicolaou, Mihalis A.
    Schuller, Bjorn W.
    Zafeiriou, Stefanos
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2017, 11 (08) : 1301 - 1309
  • [28] MODELING NONLINEAR AUDIO EFFECTS WITH END-TO-END DEEP NEURAL NETWORKS
    Ramirez, Marco A. Martinez
    Reiss, Joshua D.
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 171 - 175
  • [29] Deep Neural Networks Based End-to-End DOA Estimation System
    Ando, Daniel Akira
    Kase, Yuya
    Nishimura, Toshihiko
    Sato, Takanori
    Ohganey, Takeo
    Ogawa, Yasutaka
    Hagiwara, Junichiro
    [J]. IEICE TRANSACTIONS ON COMMUNICATIONS, 2023, E106B (12) : 1350 - 1362
  • [30] Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks
    Zhang, Ying
    Pezeshki, Mohammad
    Brakel, Philemon
    Zhang, Saizheng
    Laurent, Cesar
    Bengio, Yoshua
    Courville, Aaron
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 410 - 414