Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing

被引：5

作者：

Siebert, Tim ^{[1
]}

Clasen, Kai Norman ^{[1
]}

Ravanbakhsh, Mahdyar ^{[1
]}

Demir, Beguem ^{[1
]}

机构：

[1] Tech Univ Berlin, Einsteinufer 17, D-10587 Berlin, Germany

来源：

IMAGE AND SIGNAL PROCESSING FOR REMOTE SENSING XXVIII | 2022年 / 12267卷

基金：

欧洲研究理事会;

关键词：

Multi-modal transformer; visual question answering; deep learning; remote sensing;

D O I：

10.1117/12.2636276

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

With the new generation of satellite technologies, the archives of remote sensing (RS) images are growing very fast. To make the intrinsic information of each RS image easily accessible, visual question answering (VQA) has been introduced in RS. VQA allows a user to formulate a free-form question concerning the content of RS images to extract generic information. It has been shown that the fusion of the input modalities (i.e., image and text) is crucial for the performance of VQA systems. Most of the current fusion approaches use modality-specific representations in their fusion modules instead of joint representation learning. However, to discover the underlying relation between both the image and question modality, the model is required to learn the joint representation instead of simply combining (e.g., concatenating, adding, or multiplying) the modality-specific representations. We propose a multi-modal transformer-based architecture to overcome this issue. Our proposed architecture consists of three main modules: i) the feature extraction module for extracting the modality-specific features; ii) the fusion module, which leverages a user-defined number of multi-modal transformer layers of the Visua1BERT model (VB); and iii) the classification module to obtain the answer. In contrast to recently proposed transformer-based models in RS VQA, the presented architecture (called VBFusion) is not limited to specific questions, e.g., questions concerning pre-defined objects. Experimental results obtained on the RSVQAxBEN and RSVQA-LR datasets (which are made up of RGB bands of Sentinel-2 images) demonstrate the effectiveness of VBFusion for VQA tasks in RS. To analyze the importance of using other spectral bands for the description of the complex content of RS images in the framework of VQA, we extend the RSVQAxBEN dataset to include all the spectral bands of Sentinel-2 images with 10m and 20m spatial resolution. Experimental results show the importance of utilizing these bands to characterize the land-use land-cover classes present in the images in the framework of VQA. The code of the proposed method is publicly available at https://git.tu-berlin.de/rsim/multimodal-fusion-transformer-for-vqa-in-rs.

引用

页数：9

共 50 条

[21] Multi-level, multi-modal interactions for visual question answering over text in images
Chen, Jincai
Zhang, Sheng
Zeng, Jiangfeng
Zou, Fuhao
Li, Yuan-Fang
Liu, Tao
Lu, Ping
[J]. World Wide Web, 2022, 25 (04) : 1607 - 1623
[22] Multi-level, multi-modal interactions for visual question answering over text in images
Jincai Chen
Sheng Zhang
Jiangfeng Zeng
Fuhao Zou
Yuan-Fang Li
Tao Liu
Ping Lu
[J]. World Wide Web, 2022, 25 : 1607 - 1623
[23] Multi-level, multi-modal interactions for visual question answering over text in images
Chen, Jincai
Zhang, Sheng
Zeng, Jiangfeng
Zou, Fuhao
Li, Yuan-Fang
Liu, Tao
Lu, Ping
[J]. WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2022, 25 (04): : 1607 - 1623
[24] A Survey of Multi-modal Question Answering Systems for Robotics
Liu, Xiaomeng
Long, Fei
[J]. 2017 2ND INTERNATIONAL CONFERENCE ON ADVANCED ROBOTICS AND MECHATRONICS (ICARM), 2017, : 189 - 194
[25] Multi-Modal Alignment of Visual Question Answering Based on Multi-Hop Attention Mechanism
Xia, Qihao
Yu, Chao
Hou, Yinong
Peng, Pingping
Zheng, Zhengqi
Chen, Wen
[J]. ELECTRONICS, 2022, 11 (11)
[26] Decouple Before Interact: Multi-Modal Prompt Learning for Continual Visual Question Answering
Qian, Zi
Wang, Xin
Duan, Xuguang
Qin, Pengda
Li, Yuhong
Zhu, Wenwu
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2941 - 2950
[27] Knowledge-Based Visual Question Answering Using Multi-Modal Semantic Graph
Jiang, Lei
Meng, Zuqiang
[J]. ELECTRONICS, 2023, 12 (06)
[28] MFTransNet: A Multi-Modal Fusion with CNN-Transformer Network for Semantic Segmentation of HSR Remote Sensing Images
He, Shumeng
Yang, Houqun
Zhang, Xiaoying
Li, Xuanyu
[J]. MATHEMATICS, 2023, 11 (03)
[29] Contrasting Dual Transformer Architectures for Multi-Modal Remote Sensing Image Retrieval
Al Rahhal, Mohamad M.
Bencherif, Mohamed Abdelkader
Bazi, Yakoub
Alharbi, Abdullah
Mekhalfi, Mohamed Lamine
[J]. APPLIED SCIENCES-BASEL, 2023, 13 (01):
[30] VISUAL QUESTION ANSWERING FROM REMOTE SENSING IMAGES
Lobry, Sylvain
Murray, Jesse
Marcos, Diego
Tuia, Devis
[J]. 2019 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2019), 2019, : 4951 - 4954

← 1 2 3 4 5 →