Unifying Text, Tables, and Images for Multimodal Question Answering

被引：0

作者：

Luo, Haohao ^{[1
]}

Shen, Ying ^{[1
]}

Deng, Yang ^{[2
]}

机构：

[1] Sun Yat Sen Univ, Sch Intelligent Syst Engn, Guangzhou, Peoples R China

[2] Natl Univ Singapore, Singapore, Singapore

来源：

FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023) | 2023年

基金：

中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Multimodal question answering (MMQA), which aims to derive the answer from multiple knowledge modalities (e.g., text, tables, and images), has received increasing attention due to its board applications. Current approaches to MMQA often rely on single-modal or bimodal QA models, which limits their ability to effectively integrate information across all modalities and leverage the power of pretrained language models. To address these limitations, we propose a novel framework called UniMMQA, which unifies three different input modalities into a text-to-text format by employing position-enhanced table linearization and diversified image captioning techniques. Additionally, we enhance cross-modal reasoning by incorporating a multimodal rationale generator, which produces textual descriptions of cross-modal relations for adaptation into the text-to-text generation process. Experimental results on three MMQA benchmark datasets show the superiority of UniMMQA in both supervised and unsupervised settings.

引用

页码：9355 / 9367

页数：13

共 50 条

[41] Towards Visual Question Answering on Pathology Images
He, Xuehai
Cai, Zhuo
Wei, Wenlan
Zhang, Yichen
Mou, Luntian
Xing, Eric
Xie, Pengtao
ACL-IJCNLP 2021: THE 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 2, 2021, : 708 - 718
[42] Visual Question Answering with Textual Representations for Images
Hirota, Yusuke
Garcia, Noa
Otani, Mayu
Chu, Chenhui
Nakashima, Yuta
Taniguchi, Ittetsu
Onoye, Takao
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 3147 - 3150
[43] A Question Answering System for Unstructured Table Images
Xue, Wenyuan
Cai, Siqi
Wang, Wen
Li, Qingyong
Yu, Baosheng
Zhan, Yibing
Tao, Dacheng
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2783 - 2785
[44] A Multilingual Approach to Scene Text Visual Question Answering
Brugues i Pujolras, Josep
Gomez i Bigorda, Llufs
Karatzas, Dimosthenis
DOCUMENT ANALYSIS SYSTEMS, DAS 2022, 2022, 13237 : 65 - 79
[45] On the Impact of Semantic Roles on Text Comprehension for Question Answering
Marginean, Anca
Pricop, Gabriela
MINING INTELLIGENCE AND KNOWLEDGE EXPLORATION, MIKE 2018, 2018, 11308 : 53 - 63
[46] IMPLICIT KNOWLEDGE, QUESTION ANSWERING, AND THE REPRESENTATION OF EXPOSITORY TEXT
GRAESSER, AC
GOODMAN, SM
PSYCHOLOGY OF READING AND READING INSTRUCTION, 1985, : 109 - 171
[47] Experiments on applying a text summarization system for question answering
Balage Filho, Pedro Paulo
de Uzeda, Vinicius Rodrigues
Pardo, Thiago Alexandre Salgueiro
Nunes, Maria das Gracas Volpe
EVALUATION OF MULTILINGUAL AND MULTI-MODAL INFORMATION RETRIEVAL, 2007, 4730 : 372 - +
[48] Using machine learning and text mining in question answering
Juarez-Gonzalez, Antonio
Tellez-Valero, Alberto
Denicia-Carral, Claudia
Montes-y-Gomez, Manuel
Villasenor-Pineda, Luis
Evaluation of Multilingual and Multi-modal Information Retrieval, 2007, 4730 : 415 - 423
[49] Multi-level, multi-modal interactions for visual question answering over text in images
Jincai Chen
Sheng Zhang
Jiangfeng Zeng
Fuhao Zou
Yuan-Fang Li
Tao Liu
Ping Lu
World Wide Web, 2022, 25 : 1607 - 1623
[50] Multi-level, multi-modal interactions for visual question answering over text in images
Chen, Jincai
Zhang, Sheng
Zeng, Jiangfeng
Zou, Fuhao
Li, Yuan-Fang
Liu, Tao
Lu, Ping
World Wide Web, 2022, 25 (04) : 1607 - 1623

← 1 2 3 4 5 →