Scene classification of remote sensing images aims to assign a meaningful label to a given image. In recent years, Convolutional Neural Networks (CNNs)-based methods make a breakthrough and substantially outperform traditional methods in scene classification tasks of remote sensing images. However, obtaining features under different scales in remote sensing images is difficult due to the fixed receptive field of CNNs. This complexity seriously affects the performance of CNNs in scene classification of remote sensing images. This study proposes a method to learn the optimal scales for different scene image instances in a weakly supervised manner. A Weakly Supervised Scale Adaptive Data Augmentation Network (WSADAN) is proposed to capture feature information at different scales of remote sensing scenes, and a scale generation module and a scale fusion module are designed to improve the robustness. The scale generation module learns the optimal scale parameters based on the CNN features of the original image. The scale fusion module filters the CNN features of images with original and optimal scales to remove the noise and then deeply fuses them to exploit the correlation between features at different scales. The deeply fused multi-scale features are input into a fully connected layer to predict categories of scene images. The effectiveness of the scale generation and scale fusion modules is verified by ablation experiments. The accuracy of WSADANSGM compared with the baseline improves by 0.94% and 0.89% for the 20% and 50% training data ratios of RSSCN7 dataset, 1.27% and 0.87% for the 20% and 50% training data ratios of AID dataset, and 1.09% and 0.71% for the 10% and 20% training data ratios of NWPU dataset, respectively. Compared with WSADANSGM, WSADANSGM+SFM improves by 1.65% and 1.32% for the RSSCN7 dataset at 20% and 50% training data ratios, 1.65% and 1.26% for the AID dataset at 20% and 50% training data ratios, and 1.75% and 1.42% for the NWPU dataset at 10% and 20% training data ratios, respectively. In the experiment for scene scale change analysis, the classification accuracy of our method is higher than the baseline at any scale of image, which proves that our method can learn certain image scale information and has strong scale adaptation ability. We use three datasets for remote sensing scene classification, namely, RSSCN7, AID, and NWPU, for the experiments. On the RSSCN7 dataset, the overall accuracies are 91.65% and 94.07% with the training ratios of 20% and 50% for WSADANVGG16. For WSADAN-ResNet50, the corresponding accuracies are 92.69% and 94.82%. On the AID dataset, the overall accuracies are 92.78% and 95.18% with the training ratios of 20% and 50% for WSADAN-VGG16. For WSADAN-ResNet50, the corresponding accuracies are 93.73% and 95.88%. On the NWPU dataset, the overall accuracies are 87.01% and 90.44% with the training ratios of 10% and 20% for WSADAN-VGG16. For WSADAN-ResNet50, the corresponding accuracies are 90.71% and 92.63%. The proposed method can learn CNN features at a wider range of scales without manual multi-scale selection for different datasets. The performance of the proposed method is better than that of traditional CNNs, especially for the scene categories containing objects with large-scale variations. © 2023 Science Press. All rights reserved.