PULMO: Precise utterance-level modeling for speech anti-spoofing

被引:0
|
作者
Yoon, Sunghyun [1 ]
机构
[1] Kongju Natl Univ, Dept Artificial Intelligence, Cheonan, South Korea
基金
新加坡国家研究基金会;
关键词
Padding; Segmentation; Spoofing detection; Truncation; Utterance-level modeling; Variable-length; SPEAKER;
D O I
10.1016/j.apacoust.2024.110221
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In recent years, most state-of-the-art approaches for spoofed speech detection have been based on convolutional neural networks (CNNs). Most neural networks, including CNNs, are trained in minibatch units, where all input data in each minibatch must have the same shape. Therefore, for minibatch training, each utterance is first either padded or truncated because utterances are variable-length sequences and thus cannot be directly fed into networks in minibatch units. However, modeling either a padded or truncated utterance, rather than the original one, makes it unfeasible to capture the entire context as is: padding could propagate even unwanted information, like artifacts, in the original utterance, and truncation inevitably loses some information. With these information distortions, model could get stuck in a suboptimal solution. To fill this gap, we propose & Uacute; a method for precise utterance-level modeling that enables minibatch-wise utterance-level modeling of variable-length utterances while minimizing the information distortions. The proposed method comprises sequence segmentation followed by segment aggregation. Sequence segmentation feeds variable-length utterances in the minibatch unit by decomposing each of them into fixed-length segments, which enables parallel processing of variable-length utterances without the uncertainty in input length. Segment aggregation plays a role in aggregating the segment embeddings by utterance to encode the entire information of each utterance. The experimental results of the evaluation trials of ASVspoof 2019 and 2021 indicate that the proposed method shows up to 84.9 % and 97.6 % relative equal error rate reductions on logical and physical access scenarios, respectively. Furthermore, the proposed method reduced the FLOPs for an epoch by 6 %.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Utterance-level boosting of HMM speech recognizers
    Meyer, G
    2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 109 - 112
  • [2] The Impact of Silence on Speech Anti-Spoofing
    Zhang, Yuxiang
    Li, Zhuo
    Lu, Jingze
    Hua, Hua
    Wang, Wenchao
    Zhang, Pengyuan
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 3374 - 3389
  • [3] CausalDialogue: Modeling Utterance-level Causality in Conversations
    Tuan, Yi-Lin
    Albalak, Alon
    Xu, Wenda
    Saxon, Michael
    Pryor, Connor
    Getoor, Lise
    Wang, William Yang
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 12506 - 12522
  • [4] Modified Cepstral Feature for Speech Anti-spoofing
    何明瑞
    ZAIDI Syed Faham Ali
    田娩鑫
    单志勇
    江政儒
    徐珑婷
    Journal of Donghua University(English Edition), 2023, 40 (02) : 193 - 201
  • [5] Fusion Techniques for Utterance-Level Emotion Recognition Combining Speech and Transcripts
    Sebastian, Jilt
    Pierucci, Piero
    INTERSPEECH 2019, 2019, : 51 - 55
  • [6] Comparison of Acoustic and Kinematic Approaches to Measuring Utterance-Level Speech Variability
    Howell, Peter
    Anderson, Andrew J.
    Bartrip, Jon
    Bailey, Eleanor
    JOURNAL OF SPEECH LANGUAGE AND HEARING RESEARCH, 2009, 52 (04): : 1088 - 1096
  • [7] RawSpectrogram: On the Way to Effective Streaming Speech Anti-Spoofing
    Grinberg, Petr
    Shikhov, Vladislav
    IEEE ACCESS, 2023, 11 : 109928 - 109938
  • [8] Learning Utterance-level Representations with Label Smoothing for Speech Emotion Recognition
    Huang, Jian
    Tao, Jianhua
    Liu, Bin
    Lian, Zheng
    INTERSPEECH 2020, 2020, : 4079 - 4083
  • [9] Transferable Waveform-level Adversarial Attack against Speech Anti-spoofing Models
    Huang, Bingyuan
    Cui, Sanshuai
    Kang, Xiangui
    Li, Enping
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 2315 - 2320
  • [10] Face De-spoofing: Anti-spoofing via Noise Modeling
    Jourabloo, Amin
    Liu, Yaojie
    Liu, Xiaoming
    COMPUTER VISION - ECCV 2018, PT XIII, 2018, 11217 : 297 - 315