ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition

被引:0
|
作者
Wang, Xinyu [1 ,2 ,6 ]
Gui, Min [4 ,6 ]
Jiang, Yong [3 ]
Jia, Zixia [1 ,2 ]
Bach, Nguyen [5 ,6 ]
Wang, Tao
Huang, Zhongqiang [3 ]
Huang, Fei [3 ]
Tu, Kewei [1 ,2 ]
机构
[1] ShanghaiTech Univ, Sch Informat Sci & Technol, Shanghai, Peoples R China
[2] Shanghai Engn Res Ctr Intelligent Vis & Imaging, Shanghai, Peoples R China
[3] Alibaba Grp, ADAM Acad, Hangzhou, Peoples R China
[4] Shopee, Singapore, Singapore
[5] Microsoft, Redmond, WA USA
[6] Alibaba Grp, Hangzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, Multi-modal Named Entity Recognition (MNER) has attracted a lot of attention. Most of the work utilizes image information through region-level visual representations obtained from a pretrained object detector and relies on an attention mechanism to model the interactions between image and text representations. However, it is difficult to model such interactions as image and text representations are trained separately on the data of their respective modality and are not aligned in the same space. As text representations take the most important role in MNER, in this paper, we propose Image-text Alignments (ITA) to align image features into the textual space, so that the attention mechanism in transformerbased pretrained textual embeddings can be better utilized. ITA first aligns the image into regional object tags, image-level captions and optical characters as visual contexts, concatenates them with the input texts as a new crossmodal input, and then feeds it into a pretrained textual embedding model. This makes it easier for the attention module of a pretrained textual embedding model to model the interaction between the two modalities since they are both represented in the textual space. ITA further aligns the output distributions predicted from the cross-modal input and textual input views so that the MNER model can be more practical in dealing with text-only inputs and robust to noises from images. In our experiments, we show that ITA models can achieve state-ofthe-art accuracy on multi-modal Named Entity Recognition datasets, even without image information.(1)
引用
收藏
页码:3176 / 3189
页数:14
相关论文
共 50 条
  • [21] MIGT: Multi-modal image inpainting guided with text
    Li, Ailin
    Zhao, Lei
    Zuo, Zhiwen
    Wang, Zhizhong
    Xing, Wei
    Lu, Dongming
    NEUROCOMPUTING, 2023, 520 : 376 - 385
  • [22] Image and Encoded Text Fusion for Multi-Modal Classification
    Gallo, I.
    Calefati, A.
    Nawaz, S.
    Janjua, M. K.
    2018 INTERNATIONAL CONFERENCE ON DIGITAL IMAGE COMPUTING: TECHNIQUES AND APPLICATIONS (DICTA), 2018, : 203 - 209
  • [23] DGHC: A Hybrid Algorithm for Multi-Modal Named Entity Recognition Using Dynamic Gating and Correlation Coefficients With Visual Enhancements
    Liu, Chang
    Yang, Dongsheng
    Yu, Bihui
    Bu, Liping
    IEEE ACCESS, 2024, 12 : 69151 - 69162
  • [24] Cross-Graph Attention Enhanced Multi-Modal Correlation Learning for Fine-Grained Image-Text Retrieval
    He, Yi
    Liu, Xin
    Cheung, Yiu-ming
    Peng, Shu-Juan
    Yi, Jinhan
    Fan, Wentao
    SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1865 - 1869
  • [25] Multi-modal text recognition and encryption in scanned document images
    Maemoona Kayani
    Abdul Ghafoor
    M. Mohsin Riaz
    The Journal of Supercomputing, 2023, 79 : 7916 - 7936
  • [26] Multi-modal Emotion Recognition Based on Speech and Image
    Li, Yongqiang
    He, Qi
    Zhao, Yongping
    Yao, Hongxun
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2017, PT I, 2018, 10735 : 844 - 853
  • [27] Multi-modal text recognition and encryption in scanned document images
    Kayani, Maemoona
    Ghafoor, Abdul
    Riaz, M. Mohsin
    JOURNAL OF SUPERCOMPUTING, 2023, 79 (07): : 7916 - 7936
  • [28] Product named entity recognition in Chinese text
    Jun Zhao
    Feifan Liu
    Language Resources and Evaluation, 2008, 42 : 197 - 217
  • [29] Multi-Modal Sentiment Recognition of Online Users Based on Text-Image-Audio Fusion
    Li, Hui
    Pang, Jingwei
    Data Analysis and Knowledge Discovery, 2024, 8 (11) : 11 - 21
  • [30] Text-Image Scene Graph Fusion for Multimodal Named Entity Recognition
    Cheng J.
    Long K.
    Zhang S.
    Zhang T.
    Ma L.
    Cheng S.
    Guo Y.
    IEEE Transactions on Artificial Intelligence, 2024, 5 (06): : 2828 - 2839