Temporal Feature Prediction in Audio-Visual Deepfake Detection

被引:0
|
作者
Gao, Yuan [1 ,2 ]
Wang, Xuelong [1 ]
Zhang, Yu [1 ,3 ]
Zeng, Ping [1 ,3 ]
Ma, Yingjie [1 ]
机构
[1] Beijing Elect Sci & Technol Inst, Dept Elect & Commun Engn, Beijing 100070, Peoples R China
[2] State Informat Ctr, Beijing 100045, Peoples R China
[3] Xidian Univ, Sch Telecommun Engn, Xian 710071, Peoples R China
基金
中国博士后科学基金;
关键词
deepfake detection; deep learning; temporal feature prediction; contrastive learning; bimodal detection;
D O I
10.3390/electronics13173433
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The rapid growth of deepfake technology, generating realistic manipulated media, poses a significant threat due to potential misuse. Therefore, effective detection methods are urgently needed to prevent malicious use, as current approaches often focus on single modalities or the simple fusion of audio-visual signals, limiting their accuracy. To solve this problem, we propose a deepfake detection scheme based on bimodal temporal feature prediction, which innovatively introduces the idea of temporal feature prediction into the audio-video bimodal deepfake detection task, aiming at fully exploiting the temporal laws of audio-visual modalities. First, pairs of adjacent audio-video sequence clips are used to construct input quadruples, and a dual-stream network is employed to extract temporal feature representations from video and audio, respectively. A video prediction module and an audio prediction module are designed to capture the temporal inconsistencies within each single modality by predicting future temporal features and comparing them with reference features. Then, a projection layer network is designed to align the audio-visual features, using contrastive loss functions to perform contrastive learning and maximize the differences between real and fake video modalities. Experiments on the FakeAVCeleb dataset demonstrate superior performance with an accuracy of 84.33% and an AUC of 89.91%, outperforming existing methods and confirming the effectiveness of our approach in deepfake detection.
引用
收藏
页数:15
相关论文
共 50 条
  • [1] Joint Audio-Visual Deepfake Detection
    Zhou, Yipin
    Lim, Ser-Nam
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 14780 - 14789
  • [2] Audio-visual deepfake detection using articulatory representation learning
    Wang, Yujia
    Huang, Hua
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 248
  • [3] Transfer of Audio-Visual Temporal Training to Temporal and Spatial Audio-Visual Tasks
    Suerig, Ralf
    Bottari, Davide
    Roeder, Brigitte
    [J]. MULTISENSORY RESEARCH, 2018, 31 (06) : 556 - 578
  • [4] Joint Audio-Visual Attention with Contrastive Learning for More General Deepfake Detection
    Zhang, Yibo
    Lin, Weiguo
    Xu, Junfeng
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (05)
  • [5] CLASSIFYING LAUGHTER AND SPEECH USING AUDIO-VISUAL FEATURE PREDICTION
    Petridis, Stavros
    Asghar, Ali
    Pantic, Maja
    [J]. 2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 5254 - 5257
  • [6] Not made for each other - Audio-Visual Dissonance-based Deepfake Detection and Localization
    Chugh, Komal
    Gupta, Parul
    Dhall, Abhinav
    Subramanian, Ramanathan
    [J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 439 - 447
  • [7] Temporal structure and complexity affect audio-visual correspondence detection
    Denison, Rachel N.
    Driver, Jon
    Ruff, Christian C.
    [J]. FRONTIERS IN PSYCHOLOGY, 2013, 3
  • [8] Emotions Don't Lie: An Audio-Visual Deepfake Detection Method using Affective Cues
    Mittal, Trisha
    Bhattacharya, Uttaran
    Chandra, Rohan
    Bera, Aniket
    Manocha, Dinesh
    [J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 2823 - 2832
  • [9] The Effect of Ageing on Audio-Visual Temporal Order Judgements and Visual Gap Detection
    Harvey, Emilie C.
    Bennett, Patrick J.
    Sekuler, Allison B.
    [J]. CANADIAN JOURNAL OF EXPERIMENTAL PSYCHOLOGY-REVUE CANADIENNE DE PSYCHOLOGIE EXPERIMENTALE, 2012, 66 (04): : 298 - 298
  • [10] Audio-visual integration in temporal perception
    Wada, Y
    Kitagawa, N
    Noguchi, K
    [J]. INTERNATIONAL JOURNAL OF PSYCHOPHYSIOLOGY, 2003, 50 (1-2) : 117 - 124