Unifying Text, Tables, and Images for Multimodal Question Answering

被引:0
|
作者
Luo, Haohao [1 ]
Shen, Ying [1 ]
Deng, Yang [2 ]
机构
[1] Sun Yat Sen Univ, Sch Intelligent Syst Engn, Guangzhou, Peoples R China
[2] Natl Univ Singapore, Singapore, Singapore
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multimodal question answering (MMQA), which aims to derive the answer from multiple knowledge modalities (e.g., text, tables, and images), has received increasing attention due to its board applications. Current approaches to MMQA often rely on single-modal or bimodal QA models, which limits their ability to effectively integrate information across all modalities and leverage the power of pretrained language models. To address these limitations, we propose a novel framework called UniMMQA, which unifies three different input modalities into a text-to-text format by employing position-enhanced table linearization and diversified image captioning techniques. Additionally, we enhance cross-modal reasoning by incorporating a multimodal rationale generator, which produces textual descriptions of cross-modal relations for adaptation into the text-to-text generation process. Experimental results on three MMQA benchmark datasets show the superiority of UniMMQA in both supervised and unsupervised settings.
引用
收藏
页码:9355 / 9367
页数:13
相关论文
共 50 条
  • [41] Towards Visual Question Answering on Pathology Images
    He, Xuehai
    Cai, Zhuo
    Wei, Wenlan
    Zhang, Yichen
    Mou, Luntian
    Xing, Eric
    Xie, Pengtao
    ACL-IJCNLP 2021: THE 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 2, 2021, : 708 - 718
  • [42] Visual Question Answering with Textual Representations for Images
    Hirota, Yusuke
    Garcia, Noa
    Otani, Mayu
    Chu, Chenhui
    Nakashima, Yuta
    Taniguchi, Ittetsu
    Onoye, Takao
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 3147 - 3150
  • [43] A Question Answering System for Unstructured Table Images
    Xue, Wenyuan
    Cai, Siqi
    Wang, Wen
    Li, Qingyong
    Yu, Baosheng
    Zhan, Yibing
    Tao, Dacheng
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2783 - 2785
  • [44] A Multilingual Approach to Scene Text Visual Question Answering
    Brugues i Pujolras, Josep
    Gomez i Bigorda, Llufs
    Karatzas, Dimosthenis
    DOCUMENT ANALYSIS SYSTEMS, DAS 2022, 2022, 13237 : 65 - 79
  • [45] On the Impact of Semantic Roles on Text Comprehension for Question Answering
    Marginean, Anca
    Pricop, Gabriela
    MINING INTELLIGENCE AND KNOWLEDGE EXPLORATION, MIKE 2018, 2018, 11308 : 53 - 63
  • [46] IMPLICIT KNOWLEDGE, QUESTION ANSWERING, AND THE REPRESENTATION OF EXPOSITORY TEXT
    GRAESSER, AC
    GOODMAN, SM
    PSYCHOLOGY OF READING AND READING INSTRUCTION, 1985, : 109 - 171
  • [47] Experiments on applying a text summarization system for question answering
    Balage Filho, Pedro Paulo
    de Uzeda, Vinicius Rodrigues
    Pardo, Thiago Alexandre Salgueiro
    Nunes, Maria das Gracas Volpe
    EVALUATION OF MULTILINGUAL AND MULTI-MODAL INFORMATION RETRIEVAL, 2007, 4730 : 372 - +
  • [48] Using machine learning and text mining in question answering
    Juarez-Gonzalez, Antonio
    Tellez-Valero, Alberto
    Denicia-Carral, Claudia
    Montes-y-Gomez, Manuel
    Villasenor-Pineda, Luis
    Evaluation of Multilingual and Multi-modal Information Retrieval, 2007, 4730 : 415 - 423
  • [49] Multi-level, multi-modal interactions for visual question answering over text in images
    Jincai Chen
    Sheng Zhang
    Jiangfeng Zeng
    Fuhao Zou
    Yuan-Fang Li
    Tao Liu
    Ping Lu
    World Wide Web, 2022, 25 : 1607 - 1623
  • [50] Multi-level, multi-modal interactions for visual question answering over text in images
    Chen, Jincai
    Zhang, Sheng
    Zeng, Jiangfeng
    Zou, Fuhao
    Li, Yuan-Fang
    Liu, Tao
    Lu, Ping
    World Wide Web, 2022, 25 (04) : 1607 - 1623