Audio Spectrogram Transformer for Synthetic Speech Detection via Speech Formant Analysis

被引:4
|
作者
Cuccovillo, Luca [1 ]
Gerhardt, Milica [1 ]
Aichroth, Patrick [1 ]
机构
[1] Fraunhofer Inst Digital Media Technol IDMT, Ehrenbergstr 31, Ilmenau, Germany
关键词
synthetic speech detection; audio deepfakes; spectrogram transformer; voice formants;
D O I
10.1109/WIFS58808.2023.10374615
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, we address the challenge of synthetic speech detection, which has become increasingly important due to the latest advancements in text-to-speech and voice conversion technologies. We propose a novel multi-task neural network architecture, designed to be interpretable and specifically tailored for audio signals. The architecture includes a feature bottleneck, used to autoencode the input spectrogram, predict the fundamental frequency (f0) trajectory, and classify the speech as synthetic or natural. Hence, the synthesis detection can be considered a byproduct of attending to the energy distribution among vocal formants, providing a clear understanding of which characteristics of the input signal influence the final outcome. Our evaluation on the ASVspoof 2019 LA partition indicates better performance than the current state of the art, with an AUC score of 0.900.
引用
收藏
页数:6
相关论文
共 50 条
  • [31] Comparison of Formant Detection Methods Used in Speech Processing Applications
    Belean, Bogdan
    PROCESSES IN ISOTOPES AND MOLECULES (PIM 2013), 2013, 1565 : 85 - 89
  • [32] ACCENT DETECTION OF TELUGU SPEECH USING PROSODIC AND FORMANT FEATURES
    Mannepalli, Kasiprasad
    Sastry, P. Nrahari
    Rajesh, V.
    2015 INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATION ENGINEERING SYSTEMS (SPACES), 2015, : 318 - 322
  • [33] Speech endpoint detection based on the formant-consonance energy
    Department of Science and Technology of Electronics, University of Science and Technology of China, Hefei 230027, China
    Qinghua Daxue Xuebao, 2008, SUPPL. 1 (754-759):
  • [34] Synthesized Speech Detection Based on Spectrogram and Convolutional Neural Networks
    Nosek, Tijana
    Suzic, Sinisa
    Papic, Boris
    Jakovljevic, Nikga
    2019 27TH TELECOMMUNICATIONS FORUM (TELFOR 2019), 2019, : 305 - 308
  • [35] A human fatigue detection method based on speech spectrogram features
    Li X.
    Li G.
    Deng M.
    Wan P.
    Yan L.
    Yi Qi Yi Biao Xue Bao/Chinese Journal of Scientific Instrument, 2021, 42 (02): : 123 - 132
  • [36] Toward Robust Mispronunciation Detection via Audio-Visual Speech Recognition
    Karbasi, Mahdie
    Zeiler, Steffen
    Freiwald, Jan
    Kolossa, Dorothea
    ADVANCES IN COMPUTATIONAL INTELLIGENCE, IWANN 2019, PT II, 2019, 11507 : 655 - 666
  • [37] NON-SPEECH AUDIO EVENT DETECTION
    Portelo, Jose
    Bugalho, Miguel
    Trancoso, Isabel
    Neto, Joao
    Abad, Alberto
    Serralheiro, Antonio
    2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 1973 - 1976
  • [38] Speech Based Features Applied to the Detection of Non-speech Audio Events
    Vozarikova, Eva
    Cizmar, Anton
    12TH INTERNATIONAL CONFERENCE ON RESEARCH IN TELECOMMUNICATION TECHNOLOGIES (RTT 2010), 2010, : 125 - 128
  • [39] ESERNet: Learning spectrogram structure relationship for effective speech emotion recognition with swin transformer in classroom discourse analysis
    Liu, Tingting
    Wang, Minghong
    Yang, Bing
    Liu, Hai
    Yi, Shaoxin
    NEUROCOMPUTING, 2025, 612
  • [40] Detection of synthetic speech for the problem of imposture
    De Leon, Phillip L.
    Hernaez, Inma
    Saratxaga, Ibon
    Pucher, Michael
    Yamagishi, Junichi
    ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2011, : 4844 - 4847