AVoiD-DF: Audio-Visual Joint Learning for Detecting Deepfake

被引:40
|
作者
Yang, Wenyuan [1 ]
Zhou, Xiaoyu [2 ]
Chen, Zhikai [3 ]
Guo, Bofei [4 ]
Ba, Zhongjie [5 ,6 ]
Xia, Zhihua [7 ]
Cao, Xiaochun [1 ]
Ren, Kui [5 ,6 ]
机构
[1] Sun Yat Sen Univ, Sch Cyber Sci & Technol, Shenzhen Campus, Shenzhen 518107, Peoples R China
[2] Univ Elect Sci & Technol China, Sch Software Engn, Chengdu 610054, Peoples R China
[3] Tencent Secur, Zhuque Lab, Shenzhen 518054, Peoples R China
[4] Peking Univ, Sch Elect & Comp Engn, Shenzhen 518055, Peoples R China
[5] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou 310058, Peoples R China
[6] ZJU Hangzhou Global Sci & Technol Innovat Ctr, Hangzhou 311200, Peoples R China
[7] Jinan Univ, Coll Cyber Secur, Guangzhou 510632, Peoples R China
基金
中国国家自然科学基金;
关键词
Deepfakes; Visualization; Forgery; Detectors; Feature extraction; Faces; Electronic mail; Deepfake detection; multi-modal; audio-visual; joint learning; DISTILLATION;
D O I
10.1109/TIFS.2023.3262148
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Recently, deepfakes have raised severe concerns about the authenticity of online media. Prior works for deepfake detection have made many efforts to capture the intra-modal artifacts. However, deepfake videos in real-world scenarios often consist of a combination of audio and visual. In this paper, we propose an Audio-Visual Joint Learning for Detecting Deepfake (AVoiD-DF), which exploits audio-visual inconsistency for multi-modal forgery detection. Specifically, AVoiD-DF begins by embedding temporal-spatial information in Temporal-Spatial Encoder. A Multi-Modal Joint-Decoder is then designed to fuse multi-modal features and jointly learn inherent relationships. Afterward, a Cross-Modal Classifier is devised to detect manipulation with inter-modal and intra-modal disharmony. Since existing datasets for deepfake detection mainly focus on one modality and only cover a few forgery methods, we build a novel benchmark DefakeAVMiT for multi-modal deepfake detection. DefakeAVMiT contains sufficient visuals with corresponding audios, where any one of the modalities may be maliciously modified by multiple deepfake methods. The experimental results on DefakeAVMiT, FakeAVCeleb, and DFDC demonstrate that the AVoiD-DF outperforms many state-of-the-arts in deepfake detection. Our proposed method also yields superior generalization on various forgery techniques.
引用
收藏
页码:2015 / 2029
页数:15
相关论文
共 50 条
  • [21] Audio-visual correspondences based joint learning for instrumental playing source separation
    Liu, Tianyu
    Zhang, Peng
    Wang, Siliang
    Huang, Wei
    Zha, Yufei
    Zhang, Yanning
    NEUROCOMPUTING, 2025, 618
  • [22] Learning word-like units from joint audio-visual analysis
    Harwath, David
    Glass, James
    PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 506 - 517
  • [23] Audio-Visual Fusion for Detecting Violent Scenes in Videos
    Giannakopoulos, Theodoros
    Makris, Alexandros
    Kosmopoulos, Dimitrios
    Perantonis, Stavros
    Theodoridis, Sergios
    ARTIFICIAL INTELLIGENCE: THEORIES, MODELS AND APPLICATIONS, PROCEEDINGS, 2010, 6040 : 91 - +
  • [24] Audio-visual speech recognition based on joint training with audio-visual speech enhancement for robust speech recognition
    Hwang, Jung-Wook
    Park, Jeongkyun
    Park, Rae-Hong
    Park, Hyung-Min
    APPLIED ACOUSTICS, 2023, 211
  • [25] Joint Audio-Visual Tracking Using Particle Filters
    Dmitry N. Zotkin
    Ramani Duraiswami
    Larry S. Davis
    EURASIP Journal on Advances in Signal Processing, 2002
  • [26] Multimodal Learning Using 3D Audio-Visual Data or Audio-Visual Speech Recognition
    Su, Rongfeng
    Wang, Lan
    Liu, Xunying
    2017 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2017, : 40 - 43
  • [27] Joint audio-visual tracking using particle filters
    Zotkin, DN
    Duraiswami, R
    Davis, LS
    EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING, 2002, 2002 (11) : 1154 - 1164
  • [28] Audio-Visual Learning for Multimodal Emotion Recognition
    Fan, Siyu
    Jing, Jianan
    Wang, Chongwen
    SYMMETRY-BASEL, 2025, 17 (03):
  • [29] Special issue on joint audio-visual speech processing
    Neti, C
    Potamianos, G
    Luettin, J
    Vatikiotis-Bateson, E
    EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING, 2002, 2002 (11) : 1151 - 1153
  • [30] Joint correlation analysis of audio-visual dance figures
    Ofli, F.
    Demir, Y.
    Erzin, E.
    Yemez, Y.
    Tekalp, A. M.
    2007 IEEE 15TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS, VOLS 1-3, 2007, : 604 - 607