Spatio-Temporal Catcher: a Self-Supervised Transformer for Deepfake Video Detection

被引:3
|
作者
Li, Maosen [1 ,2 ]
Li, Xurong [2 ]
Yu, Kun [2 ]
Deng, Cheng [1 ]
Huang, Heng [3 ]
Mao, Feng [2 ]
Xue, Hui [2 ]
Li, Minghao [2 ]
机构
[1] Xidian Univ, Xian, Shaanxi, Peoples R China
[2] Alibaba Grp, Hangzhou, Zhejiang, Peoples R China
[3] Univ Maryland, College Pk, MD USA
来源
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年
关键词
deepfake video detection; self-supervised learning; video analysis;
D O I
10.1145/3581783.3613842
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
As deepfake technology has become increasingly sophisticated and accessible, making it easier for individuals with malicious intent to create convincing fake content, which has raised considerable concern in the multimedia and computer vision community. Despite significant advances in deepfake video detection, most existing methods mainly focused on model architecture and training processes with little focus on data perspectives. In this paper, we argue that data quality has become the main bottleneck of current research. To be specific, in the pre-training phase, the domain shift between pre-training and target datasets may lead to poor generalization ability. Meanwhile, in the training phase, the low fidelity of the existing datasets leads to detectors relying on specific low-level visual artifacts or inconsistency. To overcome the shortcomings, (1). In the pre-training phase, pre-train our model on high-quality facial videos by utilizing data-efficient reconstruction-based self-supervised learning to solve domain shift. (2). In the training phase, we develop a novel spatio-temporal generator that can synthesize various high-quality "fake" videos in large quantities at a low cost, which enables our model to learn more general spatio-temporal representations in a self-supervised manner. (3). Additinally, to take full advantage of synthetic "fake" videos, we adopt diversity losses at both frame and video levels to explore the diversity of clues in "fake" videos. Our proposed framework is data-efficient and does not require any real-world deepfake videos. Extensive experiments demonstrate that our method significantly improves the generalization capability. Particularly on the most challenging CDF and DFDC datasets, our method outperforms the baselines by 8.88% and 7.73% points, respectively. Our code and Appendix can be found in github.com/llosta/STC.
引用
收藏
页码:8707 / 8718
页数:12
相关论文
共 50 条
  • [21] Joint spatio-temporal features constrained self-supervised electrocardiogram representation learning
    Ao Ran
    Huafeng Liu
    Biomedical Engineering Letters, 2024, 14 : 209 - 220
  • [22] Self-Supervised Regrasping using Spatio-Temporal Tactile Features and Reinforcement Learning
    Chebotar, Yevgen
    Hausman, Karol
    Su, Zhe
    Sukhatme, Gaurav S.
    Schaal, Stefan
    2016 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS 2016), 2016, : 1960 - 1966
  • [23] Anomaly detection for key performance indicators by fusing self-supervised spatio-temporal graph attention networks
    Chen, Ningjiang
    Tu, Huan
    Zeng, Haoyang
    Ou, Yangjie
    KNOWLEDGE-BASED SYSTEMS, 2024, 300
  • [24] Self-Supervised Spatio-Temporal Representation Learning of Satellite Image Time Series
    Dumeur, Iris
    Valero, Silvia
    Inglada, Jordi
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2024, 17 (4350-4367) : 4350 - 4367
  • [25] Dynamic Spatio-Temporal Graph Reasoning for VideoQA With Self-Supervised Event Recognition
    Nie, Jie
    Wang, Xin
    Hou, Runze
    Li, Guohao
    Chen, Hong
    Zhu, Wenwu
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 4145 - 4158
  • [26] Spatio-Temporal Transformer Network for Video Restoration
    Kim, Tae Hyun
    Sajjadi, Mehdi S. M.
    Hirsch, Michael
    Schoelkopf, Bernhard
    COMPUTER VISION - ECCV 2018, PT III, 2018, 11207 : 111 - 127
  • [27] Implicitly using Human Skeleton in Self-supervised Learning: Influence on Spatio-temporal Puzzle Solving and on Video Action Recognition
    Riand, Mathieu
    Dolle, Laurent
    Le Callet, Patrick
    PROCEEDINGS OF THE 2ND INTERNATIONAL CONFERENCE ON ROBOTICS, COMPUTER VISION AND INTELLIGENT SYSTEMS (ROBOVIS), 2021, : 128 - 135
  • [28] Towards Spatio-temporal Collaborative Learning: An End-to-End Deepfake Video Detection Framework
    Guo, Wenxuan
    Du, Shuo
    Deng, Huiyuan
    Yu, Zikang
    Feng, Lin
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [29] A self-supervised spatio-temporal attention network for video-based 3D infant pose estimation
    Yin, Wang
    Chen, Linxi
    Huang, Xinrui
    Huang, Chunling
    Wang, Zhaohong
    Bian, Yang
    Wan, You
    Zhou, Yuan
    Han, Tongyan
    Yi, Ming
    MEDICAL IMAGE ANALYSIS, 2024, 96
  • [30] Self-supervised Spatio-temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics
    Wang, Jiangliu
    Jiao, Jianbo
    Bao, Linchao
    He, Shengfeng
    Liu, Yunhui
    Liu, Wei
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 4001 - 4010