A Multimodal Unified Representation Learning Framework with Masked Image Modeling for Remote Sensing Images

被引:0
|
作者
Du, Dakuan [1 ]
Liu, Tianzhu [1 ]
Gu, Yanfeng [1 ]
机构
[1] Harbin Institute of Technology, School of Electronics and Information Engineering, Harbin,150001, China
关键词
Cross modality - Feature extractor - Image modeling - Learning frameworks - Masked image modeling - Multi-modal - Multi-modal remote sensing data - Pre-training - Remote sensing data - Remote sensing images;
D O I
10.1109/TGRS.2024.3494244
中图分类号
学科分类号
摘要
The coordinated utilization of diverse types of satellite sensors provides a more comprehensive view of the Earth's surface. However, due to the significant heterogeneity across modalities and the scarcity of high-quality labels, most existing methods face bottlenecks in the underutilization of massive unlabeled multimodal satellite data, making it challenging to understand the scene comprehensively. To this end, we propose a multimodal unified representation learning framework (MURLF) based on masked image modeling (MIM) for remote sensing (RS) images, aiming to make better use of massive unlabeled multimodal RS data. MURLF leverages the consistency and complementarity relationships among modalities to extract both common and distinctive features, mitigating the challenges faced by encoders due to significant heterogeneity across various data types. In addition, MURLF uses multilevel masking independently across different modalities, using visual tokens both within the same modality and across modalities to jointly recover masked pixels as the pretext task, facilitating comprehensive cross-modal information interaction. Furthermore, we design a preselected sensor-specific feature extractor (PSFE) to exploit the heterogeneous characteristics of various data sources, thereby extracting discriminative features. By integrating the multistage PSFE with the ViT backbone, MURLF can naturally extract multimodal hierarchical representations for downstream tasks, fully preserving valuable information from each modality. The proposed MURLF is not restricted to multimodal inputs but also supports single-modal inputs during the fine-tuning stage, significantly broadening the framework's application. Extensive experiments across multiple tasks demonstrate the superiority of the proposed MURLF compared with several advanced multimodal models. The code will be released soon. © 2024 IEEE.
引用
收藏
相关论文
共 50 条
  • [1] A UNIFIED MULTIMODAL DEEP LEARNING FRAMEWORK FOR REMOTE SENSING IMAGERY CLASSIFICATION
    Hong, Danfeng
    Gao, Lianru
    Wu, Xin
    Yao, Jing
    Yokoya, Naoto
    Zhang, Bing
    [J]. 2021 11TH WORKSHOP ON HYPERSPECTRAL IMAGING AND SIGNAL PROCESSING: EVOLUTION IN REMOTE SENSING (WHISPERS), 2021,
  • [2] Transfer Representation Learning Meets Multimodal Fusion Classification for Remote Sensing Images
    Ma, Mengru
    Ma, Wenping
    Jiao, Licheng
    Liu, Xu
    Liu, Fang
    Li, Lingling
    Yang, Shuyuan
    Hou, Biao
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
  • [3] Masked Feature Modeling for Generative Self-Supervised Representation Learning of High-Resolution Remote Sensing Images
    Pang, Shiyan
    Hu, Hanchun
    Zuo, Zhiqi
    Chen, Jia
    Hu, Xiangyun
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2024, 17 : 8434 - 8449
  • [4] Contrastive Learning of Multimodal Consistency Feature Representation for Remote Sensing Image Registration
    Han, Zhen
    Lv, Ning
    Wang, Zhiyi
    Han, Wei
    Cong, Li
    Wan, Shaohua
    Chen, Chen
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2024, 17 : 10740 - 10751
  • [5] Remote Sensing Scene Classification with Masked Image Modeling
    Wang, Liya
    Tien, Alex
    [J]. MICROWAVE REMOTE SENSING: DATA PROCESSING AND APPLICATIONS II, 2023, 12732
  • [6] SegMind: Semisupervised Remote Sensing Image Semantic Segmentation With Masked Image Modeling and Contrastive Learning Method
    Li, Zhenghong
    Chen, Hao
    Wu, Jiangjiang
    Li, Jun
    Jing, Ning
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [7] RingMo: A Remote Sensing Foundation Model With Masked Image Modeling
    Sun, Xian
    Wang, Peijin
    Lu, Wanxuan
    Zhu, Zicong
    Lu, Xiaonan
    He, Qibin
    Li, Junxi
    Rong, Xuee
    Yang, Zhujun
    Chang, Hao
    He, Qinglin
    Yang, Guang
    Wang, Ruiping
    Lu, Jiwen
    Fu, Kun
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [8] CMID: A Unified Self-Supervised Learning Framework for Remote Sensing Image Understanding
    Muhtar, Dilxat
    Zhang, Xueliang
    Xiao, Pengfeng
    Li, Zhenshi
    Gu, Feng
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [9] A unified framework for MAP estimation in remote sensing image segmentation
    Farag, AA
    Mohamed, RM
    El-Baz, A
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2005, 43 (07): : 1617 - 1634
  • [10] A Novel Coarse-to-Fine Deep Learning Registration Framework for Multimodal Remote Sensing Images
    Quan, Dou
    Wei, Huiyuan
    Wang, Shuang
    Gu, Yu
    Hou, Biao
    Jiao, Licheng
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61