Precise fault detection is essential to enhance the safety and reliability of unmanned aerial vehicle (UAV) system. With poor generalization capability, model-based approaches are greatly restricted by excessive dependence on aircrafts' dynamics. In this study, a novel data-driven method, spatial-temporal graph attention Transformer network (ST-GATrans), is proposed to implement intelligent fault detection of UAVs through joint spatial-temporal learning on multivariate flight data. In the designed architecture, an enhanced graph attention network (EGAT) with graph multiheaded self-attention (GMSA) is first designed to excavate deep spatial connections inherent in the high-dimensional multivariate flight data. After establishing the spatial association among different status variables, one transformer encoder combining multiheaded self-attention (MSA) and convolutional token fusion (CTF) is developed to mine more comprehensive temporal features, whereby MSA enables to model long time dependencies and the CTF operation can capture indispensable local details. By modeling the spatial-temporal connections, the future values of multivariate flight data can be predicted precisely. The occurrence of faults will be detected by judging if the residual between predicted value and ground truth exceeds the predefined threshold. To eliminate the negative impact of noise or large fluctuations in flight data, a bidirectional adaptive exponential weighted moving average (Bi-AEWMA) method is proposed to smooth the residual sequence and determine the fault threshold. Effectiveness of the proposed methodology is verified on the collected real flight data. In different fault sceneries, the accuracy (ACC), true positive rate (TPR), and area under curve (AUC) of our approach can reach above 96%, 97%, and 0.98, respectively. The comparative experimental results demonstrate that our approach can obtain better fault detection performance.