AVoiD-DF: Audio-Visual Joint Learning for Detecting Deepfake

被引:40
|
作者
Yang, Wenyuan [1 ]
Zhou, Xiaoyu [2 ]
Chen, Zhikai [3 ]
Guo, Bofei [4 ]
Ba, Zhongjie [5 ,6 ]
Xia, Zhihua [7 ]
Cao, Xiaochun [1 ]
Ren, Kui [5 ,6 ]
机构
[1] Sun Yat Sen Univ, Sch Cyber Sci & Technol, Shenzhen Campus, Shenzhen 518107, Peoples R China
[2] Univ Elect Sci & Technol China, Sch Software Engn, Chengdu 610054, Peoples R China
[3] Tencent Secur, Zhuque Lab, Shenzhen 518054, Peoples R China
[4] Peking Univ, Sch Elect & Comp Engn, Shenzhen 518055, Peoples R China
[5] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou 310058, Peoples R China
[6] ZJU Hangzhou Global Sci & Technol Innovat Ctr, Hangzhou 311200, Peoples R China
[7] Jinan Univ, Coll Cyber Secur, Guangzhou 510632, Peoples R China
基金
中国国家自然科学基金;
关键词
Deepfakes; Visualization; Forgery; Detectors; Feature extraction; Faces; Electronic mail; Deepfake detection; multi-modal; audio-visual; joint learning; DISTILLATION;
D O I
10.1109/TIFS.2023.3262148
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Recently, deepfakes have raised severe concerns about the authenticity of online media. Prior works for deepfake detection have made many efforts to capture the intra-modal artifacts. However, deepfake videos in real-world scenarios often consist of a combination of audio and visual. In this paper, we propose an Audio-Visual Joint Learning for Detecting Deepfake (AVoiD-DF), which exploits audio-visual inconsistency for multi-modal forgery detection. Specifically, AVoiD-DF begins by embedding temporal-spatial information in Temporal-Spatial Encoder. A Multi-Modal Joint-Decoder is then designed to fuse multi-modal features and jointly learn inherent relationships. Afterward, a Cross-Modal Classifier is devised to detect manipulation with inter-modal and intra-modal disharmony. Since existing datasets for deepfake detection mainly focus on one modality and only cover a few forgery methods, we build a novel benchmark DefakeAVMiT for multi-modal deepfake detection. DefakeAVMiT contains sufficient visuals with corresponding audios, where any one of the modalities may be maliciously modified by multiple deepfake methods. The experimental results on DefakeAVMiT, FakeAVCeleb, and DFDC demonstrate that the AVoiD-DF outperforms many state-of-the-arts in deepfake detection. Our proposed method also yields superior generalization on various forgery techniques.
引用
收藏
页码:2015 / 2029
页数:15
相关论文
共 50 条
  • [31] ADVERSARIAL INPUT ABLATION FOR AUDIO-VISUAL LEARNING
    Xu, David
    Harwath, David
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7742 - 7746
  • [32] Learning Bimodal Structure in Audio-Visual Data
    Monaci, Gianluca
    Vandergheynst, Pierre
    Sommer, Friedrich T.
    IEEE TRANSACTIONS ON NEURAL NETWORKS, 2009, 20 (12): : 1898 - 1910
  • [33] AUDIO-VISUAL SPEECH INPAINTING WITH DEEP LEARNING
    Morrone, Giovanni
    Michelsanti, Daniel
    Tan, Zheng-Hua
    Jensen, Jesper
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6653 - 6657
  • [34] Audio-Visual Class-Incremental Learning
    Pian, Weiguo
    Mo, Shentong
    Guo, Yunhui
    Tian, Yapeng
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 7765 - 7777
  • [35] An audio-visual fusion framework with joint dimensionality reduction
    Liu, Ming
    Fu, Yun
    Huang, Thomas S.
    2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, : 4437 - 4440
  • [36] AN AUDIO-VISUAL AIDS AND PROGRAMMED LEARNING UNIT
    LEYTHAM, G
    MEDICAL AND BIOLOGICAL ILLUSTRATION, 1970, 20 (01): : 35 - &
  • [37] AUDIO-VISUAL LEARNING AIDS FOR THE PRIMARY GRADES
    Gray, H. A.
    ELEMENTARY SCHOOL JOURNAL, 1938, 38 (07): : 509 - 517
  • [38] Not made for each other - Audio-Visual Dissonance-based Deepfake Detection and Localization
    Chugh, Komal
    Gupta, Parul
    Dhall, Abhinav
    Subramanian, Ramanathan
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 439 - 447
  • [39] Joint Student-Teacher Learning for Audio-Visual Scene-Aware Dialog
    Hori, Chiori
    Cherian, Anoop
    Marks, Tim K.
    Hori, Takaaki
    INTERSPEECH 2019, 2019, : 1886 - 1890
  • [40] Joint low rank embedded multiple features learning for audio-visual emotion recognition
    Wang, Zhan
    Wang, Lizhi
    Huang, Hua
    NEUROCOMPUTING, 2020, 388 : 324 - 333