Multimodal Fusion Transformer for Remote Sensing Image Classification

被引:59
|
作者
Roy, Swalpa Kumar [1 ]
Deria, Ankur [2 ]
Hong, Danfeng [3 ]
Rasti, Behnood [4 ]
Plaza, Antonio [5 ]
Chanussot, Jocelyn [6 ]
机构
[1] Jalpaiguri Govt Engn Coll, Dept Comp Sci & Engn, Jalpaiguri 735102, West Bengal, India
[2] Tech Univ Munich, Dept Informat, D-85748 Garching, Germany
[3] Chinese Acad Sci, Aerosp Informat Res Inst, Beijing 100094, Peoples R China
[4] Helmholtz Inst Freiberg Resource Technol, Helmholtz Zentrum Dresden Rossendorf, D-09599 Freiberg, Germany
[5] Univ Extremadura, Hyperspectral Comp Lab, Dept Technol Comp & Commun, Escuela Politecn, Caceres 10003, Spain
[6] Univ Grenoble Alpes, CNRS, Grenoble Inst Technol Grenoble INP, GIPSA Lab, F-38000 Grenoble, France
关键词
Convolutional neural networks (CNNs); multihead cross-patch attention (mCrossPA); remote sensing (RS); vision transformer (ViT); LAND-COVER CLASSIFICATION; CONVOLUTIONAL NEURAL-NETWORKS; MULTISOURCE; PROFILES;
D O I
10.1109/TGRS.2023.3286826
中图分类号
P3 [地球物理学]; P59 [地球化学];
学科分类号
0708 ; 070902 ;
摘要
Vision transformers (ViTs) have been trending in image classification tasks due to their promising performance when compared with convolutional neural networks (CNNs). As a result, many researchers have tried to incorporate ViTs in hyperspectral image (HSI) classification tasks. To achieve satisfactory performance, close to that of CNNs, transformers need fewer parameters. ViTs and other similar transformers use an external classification (CLS) token, which is randomly initialized and often fails to generalize well, whereas other sources of multimodal datasets, such as light detection and ranging (LiDAR), offer the potential to improve these models by means of a CLS. In this article, we introduce a new multimodal fusion transformer (MFT) network, which comprises a multihead cross-patch attention (mCrossPA) for HSI land-cover classification. Our mCrossPA utilizes other sources of complementary information in addition to the HSI in the transformer encoder to achieve better generalization. The concept of tokenization is used to generate CLS and HSI patch tokens, helping to learn a distinctive representation in a reduced and hierarchical feature space. Extensive experiments are carried out on widely used benchmark datasets, i.e., the University of Houston (UH), Trento, University of Southern Mississippi Gulfpark (MUUFL), and Augsburg. We compare the results of the proposed MFT model with other state-of-the-art transformers, classical CNNs, and conventional classifiers models. The superior performance achieved by the proposed model is due to the use of mCrossPA. The source code will be made available publicly at https://github.com/AnkurDeria/MFT.
引用
收藏
页数:20
相关论文
共 50 条
  • [1] A multimodal hyper-fusion transformer for remote sensing image classification
    Ma, Mengru
    Ma, Wenping
    Jiao, Licheng
    Liu, Xu
    Li, Lingling
    Feng, Zhixi
    Liu, Fang
    Yang, Shuyuan
    [J]. INFORMATION FUSION, 2023, 96 : 66 - 79
  • [2] Fractional Fourier Image Transformer for Multimodal Remote Sensing Data Classification
    Zhao, Xudong
    Zhang, Mengmeng
    Tao, Ran
    Li, Wei
    Liao, Wenzhi
    Tian, Lianfang
    Philips, Wilfried
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (02) : 2314 - 2326
  • [3] Remote Sensing Image Classification Method Based on Fusion of CNN and Transformer
    Jin Chuan
    Tong Changqing
    [J]. LASER & OPTOELECTRONICS PROGRESS, 2023, 60 (20)
  • [4] A Multilevel Multimodal Fusion Transformer for Remote Sensing Semantic Segmentation
    Ma, Xianping
    Zhang, Xiaokang
    Pun, Man-On
    Liu, Ming
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 1 - 15
  • [5] RsMmFormer: Multimodal Transformer Using Multiscale Self-attention for Remote Sensing Image Classification
    Zhang, Bo
    Ming, Zuheng
    Liu, Yaqian
    Feng, Wei
    He, Liang
    Zhao, Kaixing
    [J]. ARTIFICIAL INTELLIGENCE, CICAI 2023, PT I, 2024, 14473 : 329 - 339
  • [6] MHFNet: An Improved HGR Multimodal Network for Informative Correlation Fusion in Remote Sensing Image Classification
    Zhang, Hongkang
    Huang, Shao-Lun
    Kuruoglu, Ercan Engin
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2024, 17 : 15052 - 15066
  • [7] IFF-Net: Irregular Feature Fusion Network for Multimodal Remote Sensing Image Classification
    Wang, Huiqing
    Wang, Huajun
    Wu, Linfeng
    [J]. APPLIED SCIENCES-BASEL, 2024, 14 (12):
  • [8] CNN and Transformer Fusion for Remote Sensing Image Semantic Segmentation
    Chen, Xin
    Li, Dongfen
    Liu, Mingzhe
    Jia, Jiaru
    [J]. REMOTE SENSING, 2023, 15 (18)
  • [9] Multimodal Fusion Remote Sensing Image-Audio Retrieval
    Yang, Rui
    Wang, Shuang
    Sun, Yingzhi
    Zhang, Huan
    Liao, Yu
    Gu, Yu
    Hou, Biao
    Jiao, Licheng
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2022, 15 : 6220 - 6235
  • [10] Exploring Transformer and Multilabel Classification for Remote Sensing Image Captioning
    Kandala, Hitesh
    Saha, Sudipan
    Banerjee, Biplab
    Zhu, Xiao Xiang
    [J]. IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2022, 19