MULTI-MODAL LEARNING WITH TEXT MERGING FOR TEXTVQA

被引:0
|
作者
Xu, Changsheng [1 ]
Xu, Zhenlong [1 ]
He, Yifan [1 ]
Zhou, Shuigeng [1 ]
Guan, Jihong [2 ]
机构
[1] Fudan Univ, Sch Comp Sci, Shanghai 200438, Peoples R China
[2] Tongji Univ, Dept Comp Sci & Techl, Shanghai 201804, Peoples R China
关键词
Visual text understanding; Text visual question answer; Multi-modal learning; Text merging;
D O I
10.1109/ICASSP43922.2022.9746969
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Text visual question answer (TextVQA) is an important task of visual text understanding, which requires to understand the text generated by text recognition module and provide correct answers to specific questions. Recent works of TextVQA have tried to combine text recognition and multi-modal learning. However, due to the lack of effective preprocessing of text recognition output, existing approaches suffer from serious contextual information missing, which leads to unsatisfactory performance. In this work, we propose a Multi-Modal Learning framework with Text Merging (MML&TM in short) for TextVQA, where we develop a text merging (TM) algorithm, which can effectively merge the word-level text obtained from the text recognition module to construct line-level and paragraph-level texts for enhancing semantic context, which is crucial to visual text understanding. The TM module can be easily incorporated into the multi-modal learning framework to generate more comprehensive answers for TextVQA. We evaluate our method on a public dataset STVQA. Experimental results show that our TM algorithm can obtain complete semantic information, which subsequently helps MML&TM generate better answers for TextVQA.
引用
收藏
页码:1985 / 1989
页数:5
相关论文
共 50 条
  • [1] Multi-modal Sensors Path Merging
    Baudouin, Leo
    Mezouar, Youcef
    Ait-Aider, Omar
    Araujo, Helder
    [J]. INTELLIGENT AUTONOMOUS SYSTEMS 13, 2016, 302 : 191 - 201
  • [2] Multi-modal anchor adaptation learning for multi-modal summarization
    Chen, Zhongfeng
    Lu, Zhenyu
    Rong, Huan
    Zhao, Chuanjun
    Xu, Fan
    [J]. NEUROCOMPUTING, 2024, 570
  • [3] Multi-Modal Representation Learning with Text-Driven Soft Masks
    Park, Jaeyoo
    Han, Bohyung
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2798 - 2807
  • [4] Unsupervised Multi-modal Learning
    Iqbal, Mohammed Shameer
    [J]. ADVANCES IN ARTIFICIAL INTELLIGENCE (AI 2015), 2015, 9091 : 343 - 346
  • [5] Learning Multi-modal Similarity
    McFee, Brian
    Lanckriet, Gert
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2011, 12 : 491 - 523
  • [6] Adversarial Attentive Multi-Modal Embedding Learning for Image-Text Matching
    Wei, Kaimin
    Zhou, Zhibo
    [J]. IEEE ACCESS, 2020, 8 : 96237 - 96248
  • [7] Multi-modal and multi-granular learning
    Zhang, Bo
    Zhang, Ling
    [J]. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2007, 4426 : 9 - +
  • [8] Multi-text multi-modal reading processes and comprehension
    Cromley, Jennifer G.
    Kunze, Andrea J.
    Dane, Aygul Parpucu
    [J]. LEARNING AND INSTRUCTION, 2021, 71
  • [9] Learning in an Inclusive Multi-Modal Environment
    Graham, Deryn
    Benest, Ian
    Nicholl, Peter
    [J]. JOURNAL OF CASES ON INFORMATION TECHNOLOGY, 2010, 12 (03) : 28 - 44
  • [10] Multi-Modal Meta Continual Learning
    Gai, Sibo
    Chen, Zhengyu
    Wang, Donglin
    [J]. 2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,