A Multimodal Unified Representation Learning Framework with Masked Image Modeling for Remote Sensing Images

被引:0
|
作者
Du, Dakuan [1 ]
Liu, Tianzhu [1 ]
Gu, Yanfeng [1 ]
机构
[1] Harbin Institute of Technology, School of Electronics and Information Engineering, Harbin,150001, China
关键词
Cross modality - Feature extractor - Image modeling - Learning frameworks - Masked image modeling - Multi-modal - Multi-modal remote sensing data - Pre-training - Remote sensing data - Remote sensing images;
D O I
10.1109/TGRS.2024.3494244
中图分类号
学科分类号
摘要
The coordinated utilization of diverse types of satellite sensors provides a more comprehensive view of the Earth's surface. However, due to the significant heterogeneity across modalities and the scarcity of high-quality labels, most existing methods face bottlenecks in the underutilization of massive unlabeled multimodal satellite data, making it challenging to understand the scene comprehensively. To this end, we propose a multimodal unified representation learning framework (MURLF) based on masked image modeling (MIM) for remote sensing (RS) images, aiming to make better use of massive unlabeled multimodal RS data. MURLF leverages the consistency and complementarity relationships among modalities to extract both common and distinctive features, mitigating the challenges faced by encoders due to significant heterogeneity across various data types. In addition, MURLF uses multilevel masking independently across different modalities, using visual tokens both within the same modality and across modalities to jointly recover masked pixels as the pretext task, facilitating comprehensive cross-modal information interaction. Furthermore, we design a preselected sensor-specific feature extractor (PSFE) to exploit the heterogeneous characteristics of various data sources, thereby extracting discriminative features. By integrating the multistage PSFE with the ViT backbone, MURLF can naturally extract multimodal hierarchical representations for downstream tasks, fully preserving valuable information from each modality. The proposed MURLF is not restricted to multimodal inputs but also supports single-modal inputs during the fine-tuning stage, significantly broadening the framework's application. Extensive experiments across multiple tasks demonstrate the superiority of the proposed MURLF compared with several advanced multimodal models. The code will be released soon. © 2024 IEEE.
引用
下载
收藏
相关论文
共 50 条
  • [41] Multimodal Remote Sensing Image Matching via Learning Features and Attention Mechanism
    Zhang, Yongxian
    Lan, Chaozhen
    Zhang, Haiming
    Ma, Guorui
    Li, Heng
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 1 - 20
  • [42] Multimodal Remote Sensing Image Matching Combining Learning Features and Delaunay Triangulation
    Zhang, Yongxian
    Liu, Yuxuan
    Zhang, Haiming
    Ma, Guorui
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
  • [43] Heterogeneous feature learning network for multimodal remote sensing image collaborative classification
    Yu, Xuchu
    Xue, Zhixiang
    Yang, Guopeng
    Yu, Anzhu
    Liu, Bing
    Hu, Qingfeng
    INTERNATIONAL JOURNAL OF REMOTE SENSING, 2024, 45 (15) : 4983 - 5007
  • [44] Dual Graph Learning Affinity Propagation for Multimodal Remote Sensing Image Clustering
    Zhang, Yongshan
    Yan, Shuaikang
    Jiang, Xinwei
    Zhang, Lefei
    Cai, Zhihua
    Li, Jun
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [45] A Unified Framework for Double-Degradation Remote Sensing Image Restoration Through Saliency-Guided Interaction Learning
    Wang, Shan
    Zhang, Libao
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 1 - 19
  • [46] ChangeCLIP: Remote sensing change detection with multimodal vision-language representation learning
    Dong, Sijun
    Wang, Libo
    Du, Bo
    Meng, Xiaoliang
    ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING, 2024, 208 : 53 - 69
  • [47] A Unified Software Framework for Automatic Precise Georeferencing of Large Remote Sensing Image Archives
    Misra, Indranil
    Moorthi, S. Manthira
    Dhar, Debajyoti
    Ramakrishnan, R.
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGIES, ICICT 2014, 2015, 46 : 812 - 819
  • [48] ChangeCLIP: Remote sensing change detection with multimodal vision-language representation learning
    Dong, Sijun
    Wang, Libo
    Du, Bo
    Meng, Xiaoliang
    ISPRS Journal of Photogrammetry and Remote Sensing, 2024, 208 : 53 - 69
  • [49] Unsupervised Representation Learning with Deep Convolutional Neural Network for Remote Sensing Images
    Yu, Yang
    Gong, Zhiqiang
    Zhong, Ping
    Shan, Jiaxin
    IMAGE AND GRAPHICS (ICIG 2017), PT II, 2017, 10667 : 97 - 108
  • [50] SimMIM: a Simple Framework for Masked Image Modeling
    Xie, Zhenda
    Zhang, Zheng
    Cao, Yue
    Lin, Yutong
    Bao, Jianmin
    Yao, Zhuliang
    Dai, Qi
    Hu, Han
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 9643 - 9653