AVoiD-DF: Audio-Visual Joint Learning for Detecting Deepfake

被引:40
|
作者
Yang, Wenyuan [1 ]
Zhou, Xiaoyu [2 ]
Chen, Zhikai [3 ]
Guo, Bofei [4 ]
Ba, Zhongjie [5 ,6 ]
Xia, Zhihua [7 ]
Cao, Xiaochun [1 ]
Ren, Kui [5 ,6 ]
机构
[1] Sun Yat Sen Univ, Sch Cyber Sci & Technol, Shenzhen Campus, Shenzhen 518107, Peoples R China
[2] Univ Elect Sci & Technol China, Sch Software Engn, Chengdu 610054, Peoples R China
[3] Tencent Secur, Zhuque Lab, Shenzhen 518054, Peoples R China
[4] Peking Univ, Sch Elect & Comp Engn, Shenzhen 518055, Peoples R China
[5] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou 310058, Peoples R China
[6] ZJU Hangzhou Global Sci & Technol Innovat Ctr, Hangzhou 311200, Peoples R China
[7] Jinan Univ, Coll Cyber Secur, Guangzhou 510632, Peoples R China
基金
中国国家自然科学基金;
关键词
Deepfakes; Visualization; Forgery; Detectors; Feature extraction; Faces; Electronic mail; Deepfake detection; multi-modal; audio-visual; joint learning; DISTILLATION;
D O I
10.1109/TIFS.2023.3262148
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Recently, deepfakes have raised severe concerns about the authenticity of online media. Prior works for deepfake detection have made many efforts to capture the intra-modal artifacts. However, deepfake videos in real-world scenarios often consist of a combination of audio and visual. In this paper, we propose an Audio-Visual Joint Learning for Detecting Deepfake (AVoiD-DF), which exploits audio-visual inconsistency for multi-modal forgery detection. Specifically, AVoiD-DF begins by embedding temporal-spatial information in Temporal-Spatial Encoder. A Multi-Modal Joint-Decoder is then designed to fuse multi-modal features and jointly learn inherent relationships. Afterward, a Cross-Modal Classifier is devised to detect manipulation with inter-modal and intra-modal disharmony. Since existing datasets for deepfake detection mainly focus on one modality and only cover a few forgery methods, we build a novel benchmark DefakeAVMiT for multi-modal deepfake detection. DefakeAVMiT contains sufficient visuals with corresponding audios, where any one of the modalities may be maliciously modified by multiple deepfake methods. The experimental results on DefakeAVMiT, FakeAVCeleb, and DFDC demonstrate that the AVoiD-DF outperforms many state-of-the-arts in deepfake detection. Our proposed method also yields superior generalization on various forgery techniques.
引用
收藏
页码:2015 / 2029
页数:15
相关论文
共 50 条
  • [1] Joint Audio-Visual Deepfake Detection
    Zhou, Yipin
    Lim, Ser-Nam
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 14780 - 14789
  • [2] Joint Audio-Visual Attention with Contrastive Learning for More General Deepfake Detection
    Zhang, Yibo
    Lin, Weiguo
    Xu, Junfeng
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (05)
  • [3] Audio-visual deepfake detection using articulatory representation learning
    Wang, Yujia
    Huang, Hua
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 248
  • [4] Temporal Feature Prediction in Audio-Visual Deepfake Detection
    Gao, Yuan
    Wang, Xuelong
    Zhang, Yu
    Zeng, Ping
    Ma, Yingjie
    ELECTRONICS, 2024, 13 (17)
  • [5] A JOINT AUDIO-VISUAL APPROACH TO AUDIO LOCALIZATION
    Jensen, Jesper Rindom
    Christensen, Mads Graesboll
    2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 454 - 458
  • [6] AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection
    Oorloff, Trevine
    Koppisetti, Surya
    Bonettini, Nicole
    Solanki, Divyaraj
    Ben Colman
    Yacoob, Yaser
    Shahriyari, Ali
    Bharaj, Gaurav
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 27092 - 27102
  • [7] Learning joint statistical models for audio-visual fusion and segregation
    Fisher, JW
    Darrell, T
    Freeman, WT
    Viola, P
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 13, 2001, 13 : 772 - 778
  • [8] Joint watermarking of audio-visual data
    Dittmann, J
    Steinebach, M
    2001 IEEE FOURTH WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 2001, : 601 - 606
  • [9] Audio-Visual Paths to Learning
    McClusky, F. D.
    EDUCATION, 1947, 68 (03): : 190 - 190
  • [10] AUDIO-VISUAL AIDS TO LEARNING
    不详
    BMJ-BRITISH MEDICAL JOURNAL, 1966, 2 (5521): : 1023 - +