MSVT: Multiple Spatiotemporal Views Transformer for DeepFake Video Detection

被引:6
|
作者
Yu, Yang [1 ,2 ]
Ni, Rongrong [1 ,2 ]
Zhao, Yao [1 ,2 ]
Yang, Siyuan [3 ]
Xia, Fen [4 ]
Jiang, Ning [4 ]
Zhao, Guoqing [4 ]
机构
[1] Beijing Jiaotong Univ, Inst Informat Sci, Beijing 100044, Peoples R China
[2] Network Technol, Beijing Key Lab Adv Informat Sci, Beijing 100044, Peoples R China
[3] Nanyang Technol Univ, Interdisciplinary Grad Program, Rapid Rich Object Search Lab, Singapore 639798, Singapore
[4] Mashang Consumer Finance Co Ltd, Chongqing 401331, Peoples R China
关键词
Generalized DeepFake detection; multiple spatiotemporal views; global-local transformer;
D O I
10.1109/TCSVT.2023.3281448
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Recently, DeepFake videos have developed rapidly, causing new security issues in society. Due to the rough spatiotemporal view, existing video-based detection methods struggle to capture fine-grained spatiotemporal information, resulting in limited generalization ability. In addition, although the transformer has achieved great success in the past few years, the application of transformer on deepfake video detection still needs to be studied. To solve this problem, in this paper, we propose a novel Multiple Spatiotemporal Views Transformer (MSVT) with Local Spatiotemporal View (LSV) and Global Spatiotemporal View (GSV), to mine more detailed spatiotemporal information. Firstly, for establishing the LSV, different from existing works that sparsely sample a single frame to build the input sequence, we employ the local-consecutive temporal view to capture vital dynamic inconsistency. Furthermore, the extracted frame features within each group are fed to the temporal transformer followed by the feature fusion module, to generate group-level spatiotemporal features. Then, we further establish Global Spatiotemporal View (GSV) by feeding all the frame features within the whole video to the temporal transformer followed by the feature fusion module. Finally, we propose a novel global-local transformer (GLT) to effectively integrate these multi-level features for mining more subtle and comprehensive features. Extensive experiments on six large datasets demonstrate that our MSVT outperforms state-of-the-art detection methods.
引用
收藏
页码:4462 / 4471
页数:10
相关论文
共 50 条
  • [1] Deepfake Video Detection with Spatiotemporal Dropout Transformer
    Zhang, Daichi
    Lin, Fanzhao
    Hua, Yingying
    Wang, Pengju
    Zeng, Dan
    Ge, Shiming
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5833 - 5841
  • [2] Deepfake Detection Using Spatiotemporal Transformer
    Kaddar, Bachir
    Fezza, Sid Ahmed
    Akhtar, Zahid
    Hamidouche, Wassim
    Hadid, Abdenour
    Serra-Sagristà, Joan
    [J]. ACM Transactions on Multimedia Computing, Communications and Applications, 2024, 20 (11)
  • [3] Spatiotemporal Inconsistency Learning for DeepFake Video Detection
    Gu, Zhihao
    Chen, Yang
    Yao, Taiping
    Ding, Shouhong
    Li, Jilin
    Huang, Feiyue
    Ma, Lizhuang
    [J]. PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 3473 - 3481
  • [4] HolisticDFD: Infusing spatiotemporal transformer embeddings for deepfake detection
    Raza, Muhammad Anas
    Malik, Khalid Mahmood
    Haq, Ijaz Ul
    [J]. INFORMATION SCIENCES, 2023, 645
  • [5] Video Transformer for Deepfake Detection with Incremental Learning
    Khan, Sohail Ahmed
    Dai, Hang
    [J]. PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 1821 - 1828
  • [6] Improved Deepfake Video Detection Using Convolutional Vision Transformer
    Deressa, Deressa Wodajo
    Lambert, Peter
    Van Wallendael, Glenn
    Atnafu, Solomon
    Mareen, Hannes
    [J]. 2024 IEEE GAMING, ENTERTAINMENT, AND MEDIA CONFERENCE, GEM 2024, 2024, : 492 - 497
  • [7] Cascaded Network Based on EfficientNet and Transformer for Deepfake Video Detection
    Deng, Liwei
    Wang, Jiandong
    Liu, Zhen
    [J]. NEURAL PROCESSING LETTERS, 2023, 55 (06) : 7057 - 7076
  • [8] Cascaded Network Based on EfficientNet and Transformer for Deepfake Video Detection
    Liwei Deng
    Jiandong Wang
    Zhen Liu
    [J]. Neural Processing Letters, 2023, 55 : 7057 - 7076
  • [9] Sharp Multiple Instance Learning for DeepFake Video Detection
    Li, Xiaodan
    Lang, Yining
    Chen, Yuefeng
    Mao, Xiaofeng
    He, Yuan
    Wang, Shuhui
    Xue, Hui
    Lu, Quan
    [J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 1864 - 1872
  • [10] High-compressed deepfake video detection with contrastive spatiotemporal distillation
    Zhu, Yizhe
    Zhang, Chunhui
    Gao, Jialin
    Sun, Xin
    Rui, Zihan
    Zhou, Xi
    [J]. NEUROCOMPUTING, 2024, 565