Spatio-Temporal Catcher: a Self-Supervised Transformer for Deepfake Video Detection

被引:3
|
作者
Li, Maosen [1 ,2 ]
Li, Xurong [2 ]
Yu, Kun [2 ]
Deng, Cheng [1 ]
Huang, Heng [3 ]
Mao, Feng [2 ]
Xue, Hui [2 ]
Li, Minghao [2 ]
机构
[1] Xidian Univ, Xian, Shaanxi, Peoples R China
[2] Alibaba Grp, Hangzhou, Zhejiang, Peoples R China
[3] Univ Maryland, College Pk, MD USA
来源
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年
关键词
deepfake video detection; self-supervised learning; video analysis;
D O I
10.1145/3581783.3613842
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
As deepfake technology has become increasingly sophisticated and accessible, making it easier for individuals with malicious intent to create convincing fake content, which has raised considerable concern in the multimedia and computer vision community. Despite significant advances in deepfake video detection, most existing methods mainly focused on model architecture and training processes with little focus on data perspectives. In this paper, we argue that data quality has become the main bottleneck of current research. To be specific, in the pre-training phase, the domain shift between pre-training and target datasets may lead to poor generalization ability. Meanwhile, in the training phase, the low fidelity of the existing datasets leads to detectors relying on specific low-level visual artifacts or inconsistency. To overcome the shortcomings, (1). In the pre-training phase, pre-train our model on high-quality facial videos by utilizing data-efficient reconstruction-based self-supervised learning to solve domain shift. (2). In the training phase, we develop a novel spatio-temporal generator that can synthesize various high-quality "fake" videos in large quantities at a low cost, which enables our model to learn more general spatio-temporal representations in a self-supervised manner. (3). Additinally, to take full advantage of synthetic "fake" videos, we adopt diversity losses at both frame and video levels to explore the diversity of clues in "fake" videos. Our proposed framework is data-efficient and does not require any real-world deepfake videos. Extensive experiments demonstrate that our method significantly improves the generalization capability. Particularly on the most challenging CDF and DFDC datasets, our method outperforms the baselines by 8.88% and 7.73% points, respectively. Our code and Appendix can be found in github.com/llosta/STC.
引用
收藏
页码:8707 / 8718
页数:12
相关论文
共 50 条
  • [41] Adherent Raindrop Removal with Self-Supervised Attention Maps and Spatio-Temporal Generative Adversarial Networks
    Alletto, Stefano
    Carlin, Casey
    Rigazio, Luca
    Ishii, Yasunori
    Tsukizawa, Sotaro
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 2329 - 2338
  • [42] Self-Supervised Global Spatio-Temporal Interaction Pre-Training for Group Activity Recognition
    Du, Zexing
    Wang, Xue
    Wang, Qing
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (09) : 5076 - 5088
  • [43] Spatio-Temporal AutoEncoder for Video Anomaly Detection
    Zhao, Yiru
    Deng, Bing
    Shen, Chen
    Liu, Yao
    Lu, Hongtao
    Hua, Xian-Sheng
    PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1933 - 1941
  • [44] Video anomaly detection with spatio-temporal dissociation
    Chang, Yunpeng
    Tu, Zhigang
    Xie, Wei
    Luo, Bin
    Zhang, Shifu
    Sui, Haigang
    Yuan, Junsong
    PATTERN RECOGNITION, 2022, 122
  • [45] Video Relation Detection with Spatio-Temporal Graph
    Qian, Xufeng
    Zhuang, Yueting
    Li, Yimeng
    Xiao, Shaoning
    Pu, Shiliang
    Xiao, Jun
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 84 - 93
  • [46] Spatio-temporal Matching for Human Detection in Video
    Zhou, Feng
    De la Torre, Fernando
    COMPUTER VISION - ECCV 2014, PT VI, 2014, 8694 : 62 - 77
  • [47] Spatio-temporal detection of video moving object
    Ren, Ming-Yi
    Li, Xiao-Feng
    Li, Zai-Ming
    Guangdianzi Jiguang/Journal of Optoelectronics Laser, 2009, 20 (07): : 911 - 915
  • [48] Self-Supervised Temporal Sensitive Hashing for Video Retrieval
    Li, Qihua
    Tian, Xing
    Ng, Wing W. Y.
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 9021 - 9035
  • [49] Parallel Spatio-Temporal Attention Transformer for Video Frame Interpolation
    Ning, Xin
    Cai, Feifan
    Li, Yuhang
    Ding, Youdong
    ELECTRONICS, 2024, 13 (10)
  • [50] Spatio-Temporal Graph Convolution Transformer for Video Question Answering
    Tang, Jiahao
    Hu, Jianguo
    Huang, Wenjun
    Shen, Shengzhi
    Pan, Jiakai
    Wang, Deming
    Ding, Yanyu
    IEEE ACCESS, 2024, 12 : 131664 - 131680