Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing

被引:5
|
作者
Siebert, Tim [1 ]
Clasen, Kai Norman [1 ]
Ravanbakhsh, Mahdyar [1 ]
Demir, Beguem [1 ]
机构
[1] Tech Univ Berlin, Einsteinufer 17, D-10587 Berlin, Germany
基金
欧洲研究理事会;
关键词
Multi-modal transformer; visual question answering; deep learning; remote sensing;
D O I
10.1117/12.2636276
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the new generation of satellite technologies, the archives of remote sensing (RS) images are growing very fast. To make the intrinsic information of each RS image easily accessible, visual question answering (VQA) has been introduced in RS. VQA allows a user to formulate a free-form question concerning the content of RS images to extract generic information. It has been shown that the fusion of the input modalities (i.e., image and text) is crucial for the performance of VQA systems. Most of the current fusion approaches use modality-specific representations in their fusion modules instead of joint representation learning. However, to discover the underlying relation between both the image and question modality, the model is required to learn the joint representation instead of simply combining (e.g., concatenating, adding, or multiplying) the modality-specific representations. We propose a multi-modal transformer-based architecture to overcome this issue. Our proposed architecture consists of three main modules: i) the feature extraction module for extracting the modality-specific features; ii) the fusion module, which leverages a user-defined number of multi-modal transformer layers of the Visua1BERT model (VB); and iii) the classification module to obtain the answer. In contrast to recently proposed transformer-based models in RS VQA, the presented architecture (called VBFusion) is not limited to specific questions, e.g., questions concerning pre-defined objects. Experimental results obtained on the RSVQAxBEN and RSVQA-LR datasets (which are made up of RGB bands of Sentinel-2 images) demonstrate the effectiveness of VBFusion for VQA tasks in RS. To analyze the importance of using other spectral bands for the description of the complex content of RS images in the framework of VQA, we extend the RSVQAxBEN dataset to include all the spectral bands of Sentinel-2 images with 10m and 20m spatial resolution. Experimental results show the importance of utilizing these bands to characterize the land-use land-cover classes present in the images in the framework of VQA. The code of the proposed method is publicly available at https://git.tu-berlin.de/rsim/multimodal-fusion-transformer-for-vqa-in-rs.
引用
收藏
页数:9
相关论文
共 50 条
  • [21] Multi-level, multi-modal interactions for visual question answering over text in images
    Chen, Jincai
    Zhang, Sheng
    Zeng, Jiangfeng
    Zou, Fuhao
    Li, Yuan-Fang
    Liu, Tao
    Lu, Ping
    [J]. World Wide Web, 2022, 25 (04) : 1607 - 1623
  • [22] Multi-level, multi-modal interactions for visual question answering over text in images
    Jincai Chen
    Sheng Zhang
    Jiangfeng Zeng
    Fuhao Zou
    Yuan-Fang Li
    Tao Liu
    Ping Lu
    [J]. World Wide Web, 2022, 25 : 1607 - 1623
  • [23] Multi-level, multi-modal interactions for visual question answering over text in images
    Chen, Jincai
    Zhang, Sheng
    Zeng, Jiangfeng
    Zou, Fuhao
    Li, Yuan-Fang
    Liu, Tao
    Lu, Ping
    [J]. WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2022, 25 (04): : 1607 - 1623
  • [24] A Survey of Multi-modal Question Answering Systems for Robotics
    Liu, Xiaomeng
    Long, Fei
    [J]. 2017 2ND INTERNATIONAL CONFERENCE ON ADVANCED ROBOTICS AND MECHATRONICS (ICARM), 2017, : 189 - 194
  • [25] Multi-Modal Alignment of Visual Question Answering Based on Multi-Hop Attention Mechanism
    Xia, Qihao
    Yu, Chao
    Hou, Yinong
    Peng, Pingping
    Zheng, Zhengqi
    Chen, Wen
    [J]. ELECTRONICS, 2022, 11 (11)
  • [26] Decouple Before Interact: Multi-Modal Prompt Learning for Continual Visual Question Answering
    Qian, Zi
    Wang, Xin
    Duan, Xuguang
    Qin, Pengda
    Li, Yuhong
    Zhu, Wenwu
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2941 - 2950
  • [27] Knowledge-Based Visual Question Answering Using Multi-Modal Semantic Graph
    Jiang, Lei
    Meng, Zuqiang
    [J]. ELECTRONICS, 2023, 12 (06)
  • [28] MFTransNet: A Multi-Modal Fusion with CNN-Transformer Network for Semantic Segmentation of HSR Remote Sensing Images
    He, Shumeng
    Yang, Houqun
    Zhang, Xiaoying
    Li, Xuanyu
    [J]. MATHEMATICS, 2023, 11 (03)
  • [29] Contrasting Dual Transformer Architectures for Multi-Modal Remote Sensing Image Retrieval
    Al Rahhal, Mohamad M.
    Bencherif, Mohamed Abdelkader
    Bazi, Yakoub
    Alharbi, Abdullah
    Mekhalfi, Mohamed Lamine
    [J]. APPLIED SCIENCES-BASEL, 2023, 13 (01):
  • [30] VISUAL QUESTION ANSWERING FROM REMOTE SENSING IMAGES
    Lobry, Sylvain
    Murray, Jesse
    Marcos, Diego
    Tuia, Devis
    [J]. 2019 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2019), 2019, : 4951 - 4954