Spatio-Temporal Catcher: a Self-Supervised Transformer for Deepfake Video Detection

被引：3

作者：

Li, Maosen ^{[1
,2
]}

Li, Xurong ^{[2
]}

Yu, Kun ^{[2
]}

Deng, Cheng ^{[1
]}

Huang, Heng ^{[3
]}

Mao, Feng ^{[2
]}

Xue, Hui ^{[2
]}

Li, Minghao ^{[2
]}

机构：

[1] Xidian Univ, Xian, Shaanxi, Peoples R China

[2] Alibaba Grp, Hangzhou, Zhejiang, Peoples R China

[3] Univ Maryland, College Pk, MD USA

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年

关键词：

deepfake video detection; self-supervised learning; video analysis;

D O I：

10.1145/3581783.3613842

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

As deepfake technology has become increasingly sophisticated and accessible, making it easier for individuals with malicious intent to create convincing fake content, which has raised considerable concern in the multimedia and computer vision community. Despite significant advances in deepfake video detection, most existing methods mainly focused on model architecture and training processes with little focus on data perspectives. In this paper, we argue that data quality has become the main bottleneck of current research. To be specific, in the pre-training phase, the domain shift between pre-training and target datasets may lead to poor generalization ability. Meanwhile, in the training phase, the low fidelity of the existing datasets leads to detectors relying on specific low-level visual artifacts or inconsistency. To overcome the shortcomings, (1). In the pre-training phase, pre-train our model on high-quality facial videos by utilizing data-efficient reconstruction-based self-supervised learning to solve domain shift. (2). In the training phase, we develop a novel spatio-temporal generator that can synthesize various high-quality "fake" videos in large quantities at a low cost, which enables our model to learn more general spatio-temporal representations in a self-supervised manner. (3). Additinally, to take full advantage of synthetic "fake" videos, we adopt diversity losses at both frame and video levels to explore the diversity of clues in "fake" videos. Our proposed framework is data-efficient and does not require any real-world deepfake videos. Extensive experiments demonstrate that our method significantly improves the generalization capability. Particularly on the most challenging CDF and DFDC datasets, our method outperforms the baselines by 8.88% and 7.73% points, respectively. Our code and Appendix can be found in github.com/llosta/STC.

引用

页码：8707 / 8718

页数：12

共 50 条

[31] Self-supervised dynamic stochastic graph network for spatio-temporal wind speed forecasting
Wu, Tangjie
Ling, Qiang
ENERGY, 2024, 304
[32] Masked Spatio-Temporal Structure Prediction for Self-supervised Learning on Point Cloud Videos
Shen, Zhiqiang
Sheng, Xiaoxiao
Fan, Hehe
Wang, Longguang
Guo, Yulan
Liu, Qiong
Wen, Hao
Zhou, Xi
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 16534 - 16543
[33] Self-Supervised Depth Completion Based on Multi-Modal Spatio-Temporal Consistency
Zhang, Quan
Chen, Xiaoyu
Wang, Xingguo
Han, Jing
Zhang, Yi
Yue, Jiang
REMOTE SENSING, 2023, 15 (01)
[34] Attention Guided Spatio-Temporal Artifacts Extraction for Deepfake Detection
Wang, Zhibing
Li, Xin
Ni, Rongrong
Zhao, Yao
PATTERN RECOGNITION AND COMPUTER VISION, PT IV, 2021, 13022 : 374 - 386
[35] Self-Supervised Action Representation Learning from Partial Spatio-Temporal Skeleton Sequences
Zhou, Yujie
Duan, Haodong
Rao, Anyi
Su, Bing
Wang, Jiaqi
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3, 2023, : 3825 - 3833
[36] Spatio-temporal Self-Supervised Representation Learning for 3D Point Clouds
Huang, Siyuan
Degrees, Yichen Xie
Zhu, Song-Chun
Zhu, Yixin
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 6515 - 6525
[37] Hybrid self-supervised monocular visual odometry system based on spatio-temporal features
Yuan, Shuangjie
Zhang, Jun
Lin, Yujia
Yang, Lu
ELECTRONIC RESEARCH ARCHIVE, 2024, 32 (05): : 3543 - 3568
[38] Temporal Scene Montage for Self-Supervised Video Scene Boundary Detection
Tan, Jiawei
Yang, Pingan
Chen, Lu
Wang, Hongxing
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (07)
[39] Spatio-Temporal Inference Transformer Network for Video Inpainting
Tudavekar, Gajanan
Saraf, Santosh S.
Patil, Sanjay R.
INTERNATIONAL JOURNAL OF IMAGE AND GRAPHICS, 2023, 23 (01)
[40] Self-Supervised Video-Centralised Transformer for Video Face Clustering
Wang, Yujiang
Dong, Mingzhi
Shen, Jie
Luo, Yiming
Lin, Yiming
Ma, Pingchuan
Petridis, Stavros
Pantic, Maja
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (11) : 12944 - 12959

← 1 2 3 4 5 →