Joint Audio-Visual Deepfake Detection

被引:39
|
作者
Zhou, Yipin [1 ]
Lim, Ser-Nam [1 ]
机构
[1] Facebook AI, Baltimore, MD 21201 USA
关键词
D O I
10.1109/ICCV48922.2021.01453
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Deepfakes ("deep learning" + "fake") are videos synthetically generated with AI algorithms. While they could be entertaining, they could also be misused for falsifying speeches and spreading misinformation. The process to create deepfakes involves both visual and auditory manipulations. Exploration on detecting visual deepfakes has produced a number of detection methods as well as datasets, while audio deepfakes (e.g. synthetic speech from text-tospeech or voice conversion systems) and the relationship between the video and audio modalities have been relatively neglected. In this work, we propose a novel visual / auditory deepfake joint detection task and show that exploiting the intrinsic synchronization between the visual and auditory modalities could benefit deepfake detection. Experiments demonstrate that the proposed joint detection framework outperforms independently trained models, and at the same time, yields superior generalization capability on unseen types of deepfakes.
引用
收藏
页码:14780 / 14789
页数:10
相关论文
共 50 条
  • [1] Joint Audio-Visual Attention with Contrastive Learning for More General Deepfake Detection
    Zhang, Yibo
    Lin, Weiguo
    Xu, Junfeng
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (05)
  • [2] Temporal Feature Prediction in Audio-Visual Deepfake Detection
    Gao, Yuan
    Wang, Xuelong
    Zhang, Yu
    Zeng, Ping
    Ma, Yingjie
    [J]. ELECTRONICS, 2024, 13 (17)
  • [3] Audio-visual deepfake detection using articulatory representation learning
    Wang, Yujia
    Huang, Hua
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 248
  • [4] AVoiD-DF: Audio-Visual Joint Learning for Detecting Deepfake
    Yang, Wenyuan
    Zhou, Xiaoyu
    Chen, Zhikai
    Guo, Bofei
    Ba, Zhongjie
    Xia, Zhihua
    Cao, Xiaochun
    Ren, Kui
    [J]. IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2023, 18 : 2015 - 2029
  • [5] Not made for each other - Audio-Visual Dissonance-based Deepfake Detection and Localization
    Chugh, Komal
    Gupta, Parul
    Dhall, Abhinav
    Subramanian, Ramanathan
    [J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 439 - 447
  • [6] A JOINT AUDIO-VISUAL APPROACH TO AUDIO LOCALIZATION
    Jensen, Jesper Rindom
    Christensen, Mads Graesboll
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 454 - 458
  • [7] Discovering joint audio-visual codewords for video event detection
    Jhuo, I-Hong
    Ye, Guangnan
    Gao, Shenghua
    Liu, Dong
    Jiang, Yu-Gang
    Lee, D. T.
    Chang, Shih-Fu
    [J]. MACHINE VISION AND APPLICATIONS, 2014, 25 (01) : 33 - 47
  • [8] Emotions Don't Lie: An Audio-Visual Deepfake Detection Method using Affective Cues
    Mittal, Trisha
    Bhattacharya, Uttaran
    Chandra, Rohan
    Bera, Aniket
    Manocha, Dinesh
    [J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 2823 - 2832
  • [9] Joint watermarking of audio-visual data
    Dittmann, J
    Steinebach, M
    [J]. 2001 IEEE FOURTH WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 2001, : 601 - 606
  • [10] Audio-visual event detection based on mining of semantic audio-visual labels
    Goh, KS
    Miyahara, K
    Radhakrishan, R
    Xiong, ZY
    Divakaran, A
    [J]. STORAGE AND RETRIEVAL METHODS AND APPLICATIONS FOR MULTIMEDIA 2004, 2004, 5307 : 292 - 299