ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition

被引：0

作者：

Wang, Xinyu ^{[1
,2
,6
]}

Gui, Min ^{[4
,6
]}

Jiang, Yong ^{[3
]}

Jia, Zixia ^{[1
,2
]}

Bach, Nguyen ^{[5
,6
]}

Wang, Tao

Huang, Zhongqiang ^{[3
]}

Huang, Fei ^{[3
]}

Tu, Kewei ^{[1
,2
]}

机构：

[1] ShanghaiTech Univ, Sch Informat Sci & Technol, Shanghai, Peoples R China

[2] Shanghai Engn Res Ctr Intelligent Vis & Imaging, Shanghai, Peoples R China

[3] Alibaba Grp, ADAM Acad, Hangzhou, Peoples R China

[4] Shopee, Singapore, Singapore

[5] Microsoft, Redmond, WA USA

[6] Alibaba Grp, Hangzhou, Peoples R China

来源：

NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES | 2022年

基金：

中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, Multi-modal Named Entity Recognition (MNER) has attracted a lot of attention. Most of the work utilizes image information through region-level visual representations obtained from a pretrained object detector and relies on an attention mechanism to model the interactions between image and text representations. However, it is difficult to model such interactions as image and text representations are trained separately on the data of their respective modality and are not aligned in the same space. As text representations take the most important role in MNER, in this paper, we propose Image-text Alignments (ITA) to align image features into the textual space, so that the attention mechanism in transformerbased pretrained textual embeddings can be better utilized. ITA first aligns the image into regional object tags, image-level captions and optical characters as visual contexts, concatenates them with the input texts as a new crossmodal input, and then feeds it into a pretrained textual embedding model. This makes it easier for the attention module of a pretrained textual embedding model to model the interaction between the two modalities since they are both represented in the textual space. ITA further aligns the output distributions predicted from the cross-modal input and textual input views so that the MNER model can be more practical in dealing with text-only inputs and robust to noises from images. In our experiments, we show that ITA models can achieve state-ofthe-art accuracy on multi-modal Named Entity Recognition datasets, even without image information.(1)

引用

页码：3176 / 3189

页数：14

共 50 条

[21] MIGT: Multi-modal image inpainting guided with text
Li, Ailin
Zhao, Lei
Zuo, Zhiwen
Wang, Zhizhong
Xing, Wei
Lu, Dongming
NEUROCOMPUTING, 2023, 520 : 376 - 385
[22] Image and Encoded Text Fusion for Multi-Modal Classification
Gallo, I.
Calefati, A.
Nawaz, S.
Janjua, M. K.
2018 INTERNATIONAL CONFERENCE ON DIGITAL IMAGE COMPUTING: TECHNIQUES AND APPLICATIONS (DICTA), 2018, : 203 - 209
[23] DGHC: A Hybrid Algorithm for Multi-Modal Named Entity Recognition Using Dynamic Gating and Correlation Coefficients With Visual Enhancements
Liu, Chang
Yang, Dongsheng
Yu, Bihui
Bu, Liping
IEEE ACCESS, 2024, 12 : 69151 - 69162
[24] Cross-Graph Attention Enhanced Multi-Modal Correlation Learning for Fine-Grained Image-Text Retrieval
He, Yi
Liu, Xin
Cheung, Yiu-ming
Peng, Shu-Juan
Yi, Jinhan
Fan, Wentao
SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1865 - 1869
[25] Multi-modal text recognition and encryption in scanned document images
Maemoona Kayani
Abdul Ghafoor
M. Mohsin Riaz
The Journal of Supercomputing, 2023, 79 : 7916 - 7936
[26] Multi-modal Emotion Recognition Based on Speech and Image
Li, Yongqiang
He, Qi
Zhao, Yongping
Yao, Hongxun
ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2017, PT I, 2018, 10735 : 844 - 853
[27] Multi-modal text recognition and encryption in scanned document images
Kayani, Maemoona
Ghafoor, Abdul
Riaz, M. Mohsin
JOURNAL OF SUPERCOMPUTING, 2023, 79 (07): : 7916 - 7936
[28] Product named entity recognition in Chinese text
Jun Zhao
Feifan Liu
Language Resources and Evaluation, 2008, 42 : 197 - 217
[29] Multi-Modal Sentiment Recognition of Online Users Based on Text-Image-Audio Fusion
Li, Hui
Pang, Jingwei
Data Analysis and Knowledge Discovery, 2024, 8 (11) : 11 - 21
[30] Text-Image Scene Graph Fusion for Multimodal Named Entity Recognition
Cheng J.
Long K.
Zhang S.
Zhang T.
Ma L.
Cheng S.
Guo Y.
IEEE Transactions on Artificial Intelligence, 2024, 5 (06): : 2828 - 2839

← 1 2 3 4 5 →