With the development of very high resolution satellite image acquisition technology, remote sensing scene classification has become an important and challenging task. In this article, aiming at tackling this task, we propose a hybrid architecture, i.e., aggregated deep Fisher feature (ADFF), which can make full use of deep convolutional features' rich semantic information and unsupervised encoding's high robustness. Unlike the previous methods, we first explore the optimal encoding layer in the pretraining CNN model, which naturally fuses the local and global image information in a novel way, making the ability of semantic acquisition further enhanced. ADFF can learn more suitable internal features from the remote sensing data, boosting the final performance. We evaluate our algorithm based on several public datasets, and the results show that our approach achieves superior performance compared with the state-of-the-art methods. The proposed ADFF obtains average classification accuracy of 98.81%, 95.21%, 86.01%, and 88.79%, respectively, on the UC Merced Land-Use, RSSCN7, NWPU-RESISC45 (10% for training), and NWPU-RESISC45 (20% for training) datasets.