Deepfake is a deep learning-based technique that generates fake face images by mimicking the distribution of original images. Deepfake images can be used for malicious intent like creating fake news; hence, it is important to detect them at an early stage. The existing works on deepfake detection mainly focus on appearance-based features and also require substantial computing resources, memory and training data to optimize the model. Since these resources may not be available in many situations, it is important to develop a lightweight model which can work under constrained resources. In this work, we propose a shallow vision transformer for deepfake detection. Our proposed model uses an attention mechanism with a multi-head attention module. The attention mechanism highlights the important sections of deepfake images, whereas the multi-head attention module determines the attention that has to be given to each of the local-level features of an image. Finally, the softmax layer is used to classify an image as real or fake. The proposed model is shallow as it has 16.48 times fewer parameters and approx 2.97 times fewer FLOPS than the baseline vision transformer. Experiments on the Real Fake Face (RFF) and Real and Fake Face Detection (RFFD) datasets show that the model can achieve an accuracy of 92.15%\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$92.15\%$$\end{document} and 88.52%\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$88.52\%$$\end{document} respectively, which are better than many of the existing state-of-the-art models for deepfake detection like GoogleNet, XceptionNet, ResNet50, MesoNet, CNN and baseline vision transformers. Importantly, shallow ViT achieves an accuracy of 90.94%\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$90.94\%$$\end{document} when only half of the RFF dataset is used for training the model, thereby demonstrating its applicability in constrained scenarios.