Not made for each other - Audio-Visual Dissonance-based Deepfake Detection and Localization

被引:70
|
作者
Chugh, Komal [1 ]
Gupta, Parul [1 ]
Dhall, Abhinav [2 ]
Subramanian, Ramanathan [1 ]
机构
[1] Indian Inst Technol Ropar, Hussainpur, India
[2] Monash Univ, Indian Inst Technol Ropar, Hussainpur, India
关键词
Deepfake detection and localization; Neural networks; Modality dissonance; Contrastive loss;
D O I
10.1145/3394171.3413700
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose detection of deepfake videos based on the dissimilarity between the audio and visual modalities, termed as the Modality Dissonance Score (MDS). We hypothesize that manipulation of either modality will lead to dis-harmony between the two modalities, e.g., loss of lip-sync, unnatural facial and lip movements, etc. MDS is computed as the mean aggregate of dissimilarity scores between audio and visual segments in a video. Discriminative features are learnt for the audio and visual channels in a chunk-wise manner, employing the cross-entropy loss for individual modalities, and a contrastive loss that models inter-modality similarity. Extensive experiments on the DFDC and DeepFake-TIMIT Datasets show that our approach outperforms the state-of-the-art by up to 7%. We also demonstrate temporal forgery localization, and show how our technique identifies the manipulated video segments.
引用
收藏
页码:439 / 447
页数:9
相关论文
共 50 条
  • [1] Joint Audio-Visual Deepfake Detection
    Zhou, Yipin
    Lim, Ser-Nam
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 14780 - 14789
  • [2] Temporal Feature Prediction in Audio-Visual Deepfake Detection
    Gao, Yuan
    Wang, Xuelong
    Zhang, Yu
    Zeng, Ping
    Ma, Yingjie
    [J]. ELECTRONICS, 2024, 13 (17)
  • [3] Audio-visual deepfake detection using articulatory representation learning
    Wang, Yujia
    Huang, Hua
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 248
  • [4] Audio-visual event detection based on mining of semantic audio-visual labels
    Goh, KS
    Miyahara, K
    Radhakrishan, R
    Xiong, ZY
    Divakaran, A
    [J]. STORAGE AND RETRIEVAL METHODS AND APPLICATIONS FOR MULTIMEDIA 2004, 2004, 5307 : 292 - 299
  • [5] Joint Audio-Visual Attention with Contrastive Learning for More General Deepfake Detection
    Zhang, Yibo
    Lin, Weiguo
    Xu, Junfeng
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (05)
  • [6] Span-based Audio-Visual Localization
    Wu, Yiling
    Zhang, Xinfeng
    Wang, Yaowei
    Huang, Qingming
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 1252 - 1260
  • [7] Binaural Audio-Visual Localization
    Wu, Xinyi
    Wu, Zhenyao
    Ju, Lili
    Wang, Song
    [J]. THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 2961 - 2968
  • [8] A JOINT AUDIO-VISUAL APPROACH TO AUDIO LOCALIZATION
    Jensen, Jesper Rindom
    Christensen, Mads Graesboll
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 454 - 458
  • [9] Egocentric Audio-Visual Object Localization
    Huang, Chao
    Flan, Yapeng
    Kurnar, Anurag
    Xu, Chenliang
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 22910 - 22921
  • [10] Emotions Don't Lie: An Audio-Visual Deepfake Detection Method using Affective Cues
    Mittal, Trisha
    Bhattacharya, Uttaran
    Chandra, Rohan
    Bera, Aniket
    Manocha, Dinesh
    [J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 2823 - 2832