Self-supervised contrastive learning (SSCL) has achieved significant milestones in remote sensing image (RSI) understanding. Its essence lies in designing an unsupervised instance discrimination pretext task to extract image features from a large number of unlabeled images that are beneficial for downstream tasks. However, existing instance discrimination-based SSCL suffers from two limitations when applied to the RSI semantic segmentation task: 1) positive sample confounding issue (SCI), SSCL treats different augmentations of the same RSI as positive samples, but the richness, complexity, and imbalance of RSI ground objects lead to the model actually pulling a variety of different ground objects closer while pulling positive samples closer, which confuse the feature of different ground objects, and 2) feature adaptation bias, SSCL treats RSI patches containing various ground objects as individual instances for discrimination and obtains instance-level features, which are not fully adapted to pixel-level or object-level semantic segmentation tasks. To address the above limitations, we consider constructing samples containing single ground objects to alleviate positive SCI and make the model obtain object-level features from the contrastive between single ground objects. Meanwhile, we observed that the discrimination information can be mapped to specific regions in RSI through the gradient of unsupervised contrastive loss, and these specific regions tend to contain single ground objects. Based on this, we propose contrastive learning with gradient-guided sampling strategy (GraSS) for RSI semantic segmentation. GraSS consists of two stages: 1) the instance discrimination warm-up stage to provide initial discrimination information to the contrastive loss gradients and 2) the gradient-guided sampling contrastive training stage to adaptively construct samples containing more singular ground objects using the discrimination information. Experimental results on three open datasets demonstrate that GraSS effectively enhances the performance of SSCL in high-resolution RSI semantic segmentation. Compared with eight baseline methods from six different types of SSCL, GraSS achieves an average improvement of 1.57% and a maximum improvement of 3.58% in terms of mean intersection over the union (mIoU). In addition, we discovered that the unsupervised contrastive loss gradients contain rich feature information, which inspires us to use gradient information more extensively during model training to attain additional model capacity. The source code is available at https://github.com/GeoX-Lab/GraSS.