PULMO: Precise utterance-level modeling for speech anti-spoofing

被引：0

作者：

Yoon, Sunghyun ^{[1
]}

机构：

[1] Kongju Natl Univ, Dept Artificial Intelligence, Cheonan, South Korea

来源：

APPLIED ACOUSTICS | 2025年 / 227卷

基金：

新加坡国家研究基金会;

关键词：

Padding; Segmentation; Spoofing detection; Truncation; Utterance-level modeling; Variable-length; SPEAKER;

D O I：

10.1016/j.apacoust.2024.110221

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

In recent years, most state-of-the-art approaches for spoofed speech detection have been based on convolutional neural networks (CNNs). Most neural networks, including CNNs, are trained in minibatch units, where all input data in each minibatch must have the same shape. Therefore, for minibatch training, each utterance is first either padded or truncated because utterances are variable-length sequences and thus cannot be directly fed into networks in minibatch units. However, modeling either a padded or truncated utterance, rather than the original one, makes it unfeasible to capture the entire context as is: padding could propagate even unwanted information, like artifacts, in the original utterance, and truncation inevitably loses some information. With these information distortions, model could get stuck in a suboptimal solution. To fill this gap, we propose & Uacute; a method for precise utterance-level modeling that enables minibatch-wise utterance-level modeling of variable-length utterances while minimizing the information distortions. The proposed method comprises sequence segmentation followed by segment aggregation. Sequence segmentation feeds variable-length utterances in the minibatch unit by decomposing each of them into fixed-length segments, which enables parallel processing of variable-length utterances without the uncertainty in input length. Segment aggregation plays a role in aggregating the segment embeddings by utterance to encode the entire information of each utterance. The experimental results of the evaluation trials of ASVspoof 2019 and 2021 indicate that the proposed method shows up to 84.9 % and 97.6 % relative equal error rate reductions on logical and physical access scenarios, respectively. Furthermore, the proposed method reduced the FLOPs for an epoch by 6 %.

引用

页数：12

共 50 条

[1] Utterance-level boosting of HMM speech recognizers
Meyer, G
2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 109 - 112
[2] The Impact of Silence on Speech Anti-Spoofing
Zhang, Yuxiang
Li, Zhuo
Lu, Jingze
Hua, Hua
Wang, Wenchao
Zhang, Pengyuan
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 3374 - 3389
[3] CausalDialogue: Modeling Utterance-level Causality in Conversations
Tuan, Yi-Lin
Albalak, Alon
Xu, Wenda
Saxon, Michael
Pryor, Connor
Getoor, Lise
Wang, William Yang
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 12506 - 12522
[4] Modified Cepstral Feature for Speech Anti-spoofing
何明瑞
ZAIDI Syed Faham Ali
田娩鑫
单志勇
江政儒
徐珑婷
Journal of Donghua University(English Edition), 2023, 40 (02) : 193 - 201
[5] Fusion Techniques for Utterance-Level Emotion Recognition Combining Speech and Transcripts
Sebastian, Jilt
Pierucci, Piero
INTERSPEECH 2019, 2019, : 51 - 55
[6] Comparison of Acoustic and Kinematic Approaches to Measuring Utterance-Level Speech Variability
Howell, Peter
Anderson, Andrew J.
Bartrip, Jon
Bailey, Eleanor
JOURNAL OF SPEECH LANGUAGE AND HEARING RESEARCH, 2009, 52 (04): : 1088 - 1096
[7] RawSpectrogram: On the Way to Effective Streaming Speech Anti-Spoofing
Grinberg, Petr
Shikhov, Vladislav
IEEE ACCESS, 2023, 11 : 109928 - 109938
[8] Learning Utterance-level Representations with Label Smoothing for Speech Emotion Recognition
Huang, Jian
Tao, Jianhua
Liu, Bin
Lian, Zheng
INTERSPEECH 2020, 2020, : 4079 - 4083
[9] Transferable Waveform-level Adversarial Attack against Speech Anti-spoofing Models
Huang, Bingyuan
Cui, Sanshuai
Kang, Xiangui
Li, Enping
2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 2315 - 2320
[10] Face De-spoofing: Anti-spoofing via Noise Modeling
Jourabloo, Amin
Liu, Yaojie
Liu, Xiaoming
COMPUTER VISION - ECCV 2018, PT XIII, 2018, 11217 : 297 - 315

← 1 2 3 4 5 →