Not made for each other - Audio-Visual Dissonance-based Deepfake Detection and Localization

被引：70

作者：

Chugh, Komal ^{[1
]}

Gupta, Parul ^{[1
]}

Dhall, Abhinav ^{[2
]}

Subramanian, Ramanathan ^{[1
]}

机构：

[1] Indian Inst Technol Ropar, Hussainpur, India

[2] Monash Univ, Indian Inst Technol Ropar, Hussainpur, India

来源：

MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA | 2020年

关键词：

Deepfake detection and localization; Neural networks; Modality dissonance; Contrastive loss;

D O I：

10.1145/3394171.3413700

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We propose detection of deepfake videos based on the dissimilarity between the audio and visual modalities, termed as the Modality Dissonance Score (MDS). We hypothesize that manipulation of either modality will lead to dis-harmony between the two modalities, e.g., loss of lip-sync, unnatural facial and lip movements, etc. MDS is computed as the mean aggregate of dissimilarity scores between audio and visual segments in a video. Discriminative features are learnt for the audio and visual channels in a chunk-wise manner, employing the cross-entropy loss for individual modalities, and a contrastive loss that models inter-modality similarity. Extensive experiments on the DFDC and DeepFake-TIMIT Datasets show that our approach outperforms the state-of-the-art by up to 7%. We also demonstrate temporal forgery localization, and show how our technique identifies the manipulated video segments.

引用

页码：439 / 447

页数：9

共 50 条

[1] Joint Audio-Visual Deepfake Detection
Zhou, Yipin
Lim, Ser-Nam
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 14780 - 14789
[2] Temporal Feature Prediction in Audio-Visual Deepfake Detection
Gao, Yuan
Wang, Xuelong
Zhang, Yu
Zeng, Ping
Ma, Yingjie
[J]. ELECTRONICS, 2024, 13 (17)
[3] Audio-visual deepfake detection using articulatory representation learning
Wang, Yujia
Huang, Hua
[J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 248
[4] Audio-visual event detection based on mining of semantic audio-visual labels
Goh, KS
Miyahara, K
Radhakrishan, R
Xiong, ZY
Divakaran, A
[J]. STORAGE AND RETRIEVAL METHODS AND APPLICATIONS FOR MULTIMEDIA 2004, 2004, 5307 : 292 - 299
[5] Joint Audio-Visual Attention with Contrastive Learning for More General Deepfake Detection
Zhang, Yibo
Lin, Weiguo
Xu, Junfeng
[J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (05)
[6] Span-based Audio-Visual Localization
Wu, Yiling
Zhang, Xinfeng
Wang, Yaowei
Huang, Qingming
[J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 1252 - 1260
[7] Binaural Audio-Visual Localization
Wu, Xinyi
Wu, Zhenyao
Ju, Lili
Wang, Song
[J]. THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 2961 - 2968
[8] A JOINT AUDIO-VISUAL APPROACH TO AUDIO LOCALIZATION
Jensen, Jesper Rindom
Christensen, Mads Graesboll
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 454 - 458
[9] Egocentric Audio-Visual Object Localization
Huang, Chao
Flan, Yapeng
Kurnar, Anurag
Xu, Chenliang
[J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 22910 - 22921
[10] Emotions Don't Lie: An Audio-Visual Deepfake Detection Method using Affective Cues
Mittal, Trisha
Bhattacharya, Uttaran
Chandra, Rohan
Bera, Aniket
Manocha, Dinesh
[J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 2823 - 2832

← 1 2 3 4 5 →