Convolutional neural networks (CNN) have been widely used in image scene classification and have achieved remarkable progress. However, because the extracted deep features can neither focus on the local semantics of the image, nor capture the spatial morphological variation of the image, it is not appropriate to directly use CNN to generate the distinguishable feature representations. To relieve this limitation, a global-local feature adaptive fusion (GLFAF) network is proposed. The GLFAF framework extracts multi-scale and multi-level features by using a designed CNN. Then, to leverage the complementary advantages of the multi-scale and multi-level features, we design a global feature aggregate module to discover global attention features and further learn the multiple deep dependencies of spatial scale variations among these global features. Meanwhile, a local feature aggregate module is designed to aggregate the multi-scale and multi-level features. Specially, multi-level features at the same scale are fused based on channel attention, and then spatial fused features at different scales are aggregated based on channel dependence. Moreover, spatial contextual attention is designed to refine spatial features across scales and different fisher vector layers are designed to learn semantic aggregation among spatial features. Subsequently, two different feature adaptive fusion modules are introduced to explore the complementary associations of global and local aggregate features, which can obtain comprehensive and differentiated image scene presentation. Finally, a large number of experiments on real scene datasets coming from three different fields show that the proposed GLFAF approach can more accurately realize scene classification than other state-of-the-art models.