ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition

被引:0
|
作者
Wang, Xinyu [1 ,2 ,6 ]
Gui, Min [4 ,6 ]
Jiang, Yong [3 ]
Jia, Zixia [1 ,2 ]
Bach, Nguyen [5 ,6 ]
Wang, Tao
Huang, Zhongqiang [3 ]
Huang, Fei [3 ]
Tu, Kewei [1 ,2 ]
机构
[1] ShanghaiTech Univ, Sch Informat Sci & Technol, Shanghai, Peoples R China
[2] Shanghai Engn Res Ctr Intelligent Vis & Imaging, Shanghai, Peoples R China
[3] Alibaba Grp, ADAM Acad, Hangzhou, Peoples R China
[4] Shopee, Singapore, Singapore
[5] Microsoft, Redmond, WA USA
[6] Alibaba Grp, Hangzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, Multi-modal Named Entity Recognition (MNER) has attracted a lot of attention. Most of the work utilizes image information through region-level visual representations obtained from a pretrained object detector and relies on an attention mechanism to model the interactions between image and text representations. However, it is difficult to model such interactions as image and text representations are trained separately on the data of their respective modality and are not aligned in the same space. As text representations take the most important role in MNER, in this paper, we propose Image-text Alignments (ITA) to align image features into the textual space, so that the attention mechanism in transformerbased pretrained textual embeddings can be better utilized. ITA first aligns the image into regional object tags, image-level captions and optical characters as visual contexts, concatenates them with the input texts as a new crossmodal input, and then feeds it into a pretrained textual embedding model. This makes it easier for the attention module of a pretrained textual embedding model to model the interaction between the two modalities since they are both represented in the textual space. ITA further aligns the output distributions predicted from the cross-modal input and textual input views so that the MNER model can be more practical in dealing with text-only inputs and robust to noises from images. In our experiments, we show that ITA models can achieve state-ofthe-art accuracy on multi-modal Named Entity Recognition datasets, even without image information.(1)
引用
收藏
页码:3176 / 3189
页数:14
相关论文
共 50 条
  • [1] An Image-Text Matching Method for Multi-Modal Robots
    Zheng, Ke
    Li, Zhou
    JOURNAL OF ORGANIZATIONAL AND END USER COMPUTING, 2024, 36 (01)
  • [2] Cybersecurity Named Entity Recognition Using Multi-Modal Ensemble Learning
    Yi, Feng
    Jiang, Bo
    Wang, Lu
    Wu, Jianjun
    IEEE ACCESS, 2020, 8 : 63214 - 63224
  • [3] Multi-Modal Memory Enhancement Attention Network for Image-Text Matching
    Ji, Zhong
    Lin, Zhigang
    Wang, Haoran
    He, Yuqing
    IEEE ACCESS, 2020, 8 : 38438 - 38447
  • [4] Multi-Modal Learning with Joint Image-Text Embeddings and Decoder Networks
    Chemmanam, Ajai John
    Jose, Bijoy A.
    Moopan, Asif
    2024 IEEE 7TH INTERNATIONAL CONFERENCE ON INDUSTRIAL CYBER-PHYSICAL SYSTEMS, ICPS 2024, 2024,
  • [5] Adversarial Attentive Multi-Modal Embedding Learning for Image-Text Matching
    Wei, Kaimin
    Zhou, Zhibo
    IEEE ACCESS, 2020, 8 (08): : 96237 - 96248
  • [6] Multi-modal Graph Fusion for Named Entity Recognition with Targeted Visual Guidance
    Zhang, Dong
    Wei, Suzhong
    Li, Shoushan
    Wu, Hanqian
    Zhu, Qiaoming
    Zhou, Guodong
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 14347 - 14355
  • [7] MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
    Vasu, Pavan Kumar Anasosalu
    Pouransari, Hadi
    Faghri, Fartash
    Vemulapalli, Raviteja
    Tuzel, Oncel
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 15963 - 15974
  • [8] Fine-grained multimodal named entity recognition with heterogeneous image-text similarity graphs
    Wang, Yongpeng
    Jiang, Chunmao
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2024, : 2401 - 2415
  • [9] RSRNeT: a novel multi-modal network framework for named entity recognition and relation extraction
    Wang, Min
    Chen, Hongbin
    Shen, Dingcai
    Li, Baolei
    Hu, Shiyu
    PEERJ COMPUTER SCIENCE, 2024, 10
  • [10] Deep Multi-Modal Metric Learning with Multi-Scale Correlation for Image-Text Retrieval
    Hua, Yan
    Yang, Yingyun
    Du, Jianhe
    ELECTRONICS, 2020, 9 (03)