ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition

被引：0

作者：

Wang, Xinyu ^{[1
,2
,6
]}

Gui, Min ^{[4
,6
]}

Jiang, Yong ^{[3
]}

Jia, Zixia ^{[1
,2
]}

Bach, Nguyen ^{[5
,6
]}

Wang, Tao

Huang, Zhongqiang ^{[3
]}

Huang, Fei ^{[3
]}

Tu, Kewei ^{[1
,2
]}

机构：

[1] ShanghaiTech Univ, Sch Informat Sci & Technol, Shanghai, Peoples R China

[2] Shanghai Engn Res Ctr Intelligent Vis & Imaging, Shanghai, Peoples R China

[3] Alibaba Grp, ADAM Acad, Hangzhou, Peoples R China

[4] Shopee, Singapore, Singapore

[5] Microsoft, Redmond, WA USA

[6] Alibaba Grp, Hangzhou, Peoples R China

来源：

NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES | 2022年

基金：

中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, Multi-modal Named Entity Recognition (MNER) has attracted a lot of attention. Most of the work utilizes image information through region-level visual representations obtained from a pretrained object detector and relies on an attention mechanism to model the interactions between image and text representations. However, it is difficult to model such interactions as image and text representations are trained separately on the data of their respective modality and are not aligned in the same space. As text representations take the most important role in MNER, in this paper, we propose Image-text Alignments (ITA) to align image features into the textual space, so that the attention mechanism in transformerbased pretrained textual embeddings can be better utilized. ITA first aligns the image into regional object tags, image-level captions and optical characters as visual contexts, concatenates them with the input texts as a new crossmodal input, and then feeds it into a pretrained textual embedding model. This makes it easier for the attention module of a pretrained textual embedding model to model the interaction between the two modalities since they are both represented in the textual space. ITA further aligns the output distributions predicted from the cross-modal input and textual input views so that the MNER model can be more practical in dealing with text-only inputs and robust to noises from images. In our experiments, we show that ITA models can achieve state-ofthe-art accuracy on multi-modal Named Entity Recognition datasets, even without image information.(1)

引用

页码：3176 / 3189

页数：14

共 50 条

[1] An Image-Text Matching Method for Multi-Modal Robots
Zheng, Ke
Li, Zhou
JOURNAL OF ORGANIZATIONAL AND END USER COMPUTING, 2024, 36 (01)
[2] Cybersecurity Named Entity Recognition Using Multi-Modal Ensemble Learning
Yi, Feng
Jiang, Bo
Wang, Lu
Wu, Jianjun
IEEE ACCESS, 2020, 8 : 63214 - 63224
[3] Multi-Modal Memory Enhancement Attention Network for Image-Text Matching
Ji, Zhong
Lin, Zhigang
Wang, Haoran
He, Yuqing
IEEE ACCESS, 2020, 8 : 38438 - 38447
[4] Multi-Modal Learning with Joint Image-Text Embeddings and Decoder Networks
Chemmanam, Ajai John
Jose, Bijoy A.
Moopan, Asif
2024 IEEE 7TH INTERNATIONAL CONFERENCE ON INDUSTRIAL CYBER-PHYSICAL SYSTEMS, ICPS 2024, 2024,
[5] Adversarial Attentive Multi-Modal Embedding Learning for Image-Text Matching
Wei, Kaimin
Zhou, Zhibo
IEEE ACCESS, 2020, 8 (08): : 96237 - 96248
[6] Multi-modal Graph Fusion for Named Entity Recognition with Targeted Visual Guidance
Zhang, Dong
Wei, Suzhong
Li, Shoushan
Wu, Hanqian
Zhu, Qiaoming
Zhou, Guodong
THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 14347 - 14355
[7] MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
Vasu, Pavan Kumar Anasosalu
Pouransari, Hadi
Faghri, Fartash
Vemulapalli, Raviteja
Tuzel, Oncel
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 15963 - 15974
[8] Fine-grained multimodal named entity recognition with heterogeneous image-text similarity graphs
Wang, Yongpeng
Jiang, Chunmao
INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2024, : 2401 - 2415
[9] RSRNeT: a novel multi-modal network framework for named entity recognition and relation extraction
Wang, Min
Chen, Hongbin
Shen, Dingcai
Li, Baolei
Hu, Shiyu
PEERJ COMPUTER SCIENCE, 2024, 10
[10] Deep Multi-Modal Metric Learning with Multi-Scale Correlation for Image-Text Retrieval
Hua, Yan
Yang, Yingyun
Du, Jianhe
ELECTRONICS, 2020, 9 (03)

← 1 2 3 4 5 →