RingMo: A Remote Sensing Foundation Model With Masked Image Modeling

被引:81
|
作者
Sun, Xian [1 ,2 ,3 ]
Wang, Peijin [1 ,2 ]
Lu, Wanxuan [1 ,2 ]
Zhu, Zicong [1 ,2 ,3 ]
Lu, Xiaonan [1 ,2 ,3 ]
He, Qibin [1 ,2 ,3 ]
Li, Junxi [1 ,2 ,3 ]
Rong, Xuee [1 ,2 ,3 ]
Yang, Zhujun [1 ,2 ,3 ]
Chang, Hao [1 ,2 ,3 ]
He, Qinglin [4 ]
Yang, Guang [4 ]
Wang, Ruiping [5 ,6 ]
Lu, Jiwen
Fu, Kun [1 ,2 ,3 ]
机构
[1] Chinese Acad Sci, Aerosp Informat Res Inst, Beijing 100190, Peoples R China
[2] Chinese Acad Sci, Aerosp Informat Res Inst, Key Lab Network Informat Syst Technol NIST, Beijing 100190, Peoples R China
[3] Univ Chinese Acad Sci, Sch Elect Elect & Commun Engn, Beijing 100190, Peoples R China
[4] Huawei, Ascend Comp Ecosyst Enablement Dept, Hangzhou 310000, Peoples R China
[5] Chinese Acad Sci, Key Lab Intelligent Informat Proc, Beijing 100190, Peoples R China
[6] Chinese Acad Sci, Inst Comp Technol, Beijing 100190, Peoples R China
基金
中国国家自然科学基金;
关键词
Foundation model; masked image modeling (MIM); pretraining; remote sensing (RS); self-supervised; Vision Transformer (ViT); CONVOLUTIONAL NEURAL-NETWORK; SEMANTIC SEGMENTATION; OBJECT DETECTION; SCENE CLASSIFICATION; REPRESENTATIONS; INVARIANT; ATTENTION;
D O I
10.1109/TGRS.2022.3194732
中图分类号
P3 [地球物理学]; P59 [地球化学];
学科分类号
0708 ; 070902 ;
摘要
Deep learning approaches have contributed to the rapid development of remote sensing (RS) image interpretation. The most widely used training paradigm is to use ImageNet pretrained models to process RS data for specified tasks. However, there are issues such as domain gap between natural and RS scenes and the poor generalization capacity of RS models. It makes sense to develop a foundation model with general RS feature representation. Since a large amount of unlabeled data is available, the self-supervised method has more development significance than the fully supervised method in RS. However, most of the current self-supervised methods use contrastive learning, whose performance is sensitive to data augmentation, additional information, and selection of positive and negative pairs. In this article, we leverage the benefits of generative self-supervised learning (SSL) for RS images and propose an RS foundation model framework called RingMo, which consists of two parts. First, a large-scale dataset is constructed by collecting two million RS images from satellite and aerial platforms, covering multiple scenes and objects around the world. Second, we propose an RS foundation model training method designed for dense and small objects in complicated RS scenes. We show that the foundation model trained on our dataset with RingMo method achieves state-of-the-art (SOTA) on eight datasets across four downstream tasks, demonstrating the effectiveness of the proposed framework. Through in-depth exploration, we believe it is time for RS researchers to embrace generative SSL and leverage its general representation capabilities to speed up the development of RS applications.
引用
收藏
页数:22
相关论文
共 50 条
  • [1] Remote Sensing Scene Classification with Masked Image Modeling
    Wang, Liya
    Tien, Alex
    [J]. MICROWAVE REMOTE SENSING: DATA PROCESSING AND APPLICATIONS II, 2023, 12732
  • [2] RingMo-SAM: A Foundation Model for Segment Anything in Multimodal Remote-Sensing Images
    Yan, Zhiyuan
    Li, Junxi
    Li, Xuexue
    Zhou, Ruixue
    Zhang, Wenkai
    Feng, Yingchao
    Diao, Wenhui
    Fu, Kun
    Sun, Xian
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61 : 1 - 16
  • [3] RingMo-Sense: Remote Sensing Foundation Model for Spatiotemporal Prediction via Spatiotemporal Evolution Disentangling
    Yao, Fanglong
    Lu, Wanxuan
    Yang, Heming
    Xu, Liangyu
    Liu, Chenglong
    Hu, Leiyi
    Yu, Hongfeng
    Liu, Nayu
    Deng, Chubo
    Tang, Deke
    Chen, Changshuo
    Yu, Jiaqi
    Sun, Xian
    Fu, Kun
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [4] RingMo-Sense: Remote Sensing Foundation Model for Spatiotemporal Prediction via Spatiotemporal Evolution Disentangling
    Yao, Fanglong
    Lu, Wanxuan
    Yang, Heming
    Xu, Liangyu
    Liu, Chenglong
    Hu, Leiyi
    Yu, Hongfeng
    Liu, Nayu
    Deng, Chubo
    Tang, Deke
    Chen, Changshuo
    Yu, Jiaqi
    Sun, Xian
    Fu, Kun
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [5] A Multimodal Unified Representation Learning Framework with Masked Image Modeling for Remote Sensing Images
    Du, Dakuan
    Liu, Tianzhu
    Gu, Yanfeng
    [J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62
  • [6] Consistency Regularization Based on Masked Image Modeling for Semisupervised Remote Sensing Semantic Segmentation
    Cai, Miaoxin
    Chen, He
    Zhang, Tong
    Zhuang, Yin
    Chen, Liang
    [J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2024, 17 : 17442 - 17460
  • [7] SegMind: Semisupervised Remote Sensing Image Semantic Segmentation With Masked Image Modeling and Contrastive Learning Method
    Li, Zhenghong
    Chen, Hao
    Wu, Jiangjiang
    Li, Jun
    Jing, Ning
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [8] Laddering vision foundation model for remote sensing image change detection
    Liu, Yingying
    Zhou, Gang
    [J]. Journal of Applied Remote Sensing, 2024, 18 (03)
  • [9] Generative ConvNet Foundation Model With Sparse Modeling and Low-Frequency Reconstruction for Remote Sensing Image Interpretation
    Dong, Zhe
    Gu, Yanfeng
    Liu, Tianzhu
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 1 - 16
  • [10] SpectralGPT: Spectral Remote Sensing Foundation Model
    Hong, Danfeng
    Zhang, Bing
    Li, Xuyang
    Li, Yuxuan
    Li, Chenyu
    Yao, Jing
    Yokoya, Naoto
    Li, Hao
    Ghamisi, Pedram
    Jia, Xiuping
    Plaza, Antonio
    Gamba, Paolo
    Benediktsson, Jon Atli
    Chanussot, Jocelyn
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (08) : 5227 - 5244