Semantic segmentation of very high-resolution (VHR) remote sensing images is a fundamental task for many applications. However, large variations in the scales of objects in those VHR images pose a challenge for performing accurate semantic segmentation. Existing semantic segmentation networks are able to analyze an input image at up to four resizing scales, but this may be insufficient given the diversity of object scales. Therefore, multiscale (MS) test-time data augmentation is often used in practice to obtain more accurate segmentation results, which makes equal use of the segmentation results obtained at the different resizing scales. However, it was found in this study that different classes of objects had their preferred resizing scale for more accurate semantic segmentation. Based on this behavior, a stacking-based semantic segmentation (SBSS) framework is proposed to improve the segmentation results by learning this behavior, which contains a learnable error correction module (ECM) for segmentation result fusion and an error correction scheme (ECS) for computational complexity control. Two ECS, i.e., ECS-MS and ECS-single-scale (SS), are proposed and investigated in this study. The floating-point operations (Flops) required for ECS-MS and ECS-SS are similar to the commonly used MS test and the SS test, respectively. Extensive experiments on four datasets (i.e., Cityscapes, UAVid, LoveDA, and Potsdam) show that SBSS is an effective and flexible framework. It achieved higher accuracy than MS when using ECS-MS, and similar accuracy as SS with a quarter of the memory footprint when using ECS-SS.