Review of Multimodal Named Entity Recognition Studies

被引:0
|
作者
Han P. [1 ,2 ]
Chen W. [1 ]
机构
[1] School of Management, Nanjing University of Posts and Telecommunications, Nanjing
[2] Provincial Key Laboratory of Data Engineering and Knowledge Service, Nanjing University, Nanjing
关键词
Feature Representation; Multimodal Fusion; Multimodal Named Entity Recognition; Multimodal Pre-training;
D O I
10.11925/infotech.2096-3467.2023.0488
中图分类号
学科分类号
摘要
[Objective] This paper reviews multimodal named entity recognition research to provide references for future studies. [Coverage] We selected 83 representative papers using“multimodal named entity recognition”, “multimodal information extraction”, and“multimodal knowledge graph”as the search terms for the Web of Science, IEEE Xplore, ACM digital library, and CNKI databases. [Methods] We summarized the multimodal named entity recognition research in four aspects: concepts, feature representation, fusion strategies, and pretrained models. We also identified existing problems and future research directions. [Results] Multimodal named entity recognition studies focus on modal feature representation and fusion. It made some progress in the field of social media. They need to improve multimodal fine-grained feature extraction and semantic association mapping methods to enhance the models’generalization and interpretability. [Limitations] There is insufficient literature directly using multimodal named entity recognition as a research topic. [Conclusions] Our study provides new ideas to expand the applications of multimodal learning, break the modal barriers, and bridge the semantic gaps. © 2024 Chinese Academy of Sciences. All rights reserved.
引用
收藏
页码:50 / 63
页数:13
相关论文
共 83 条
  • [41] Ren S Q, He K M, Girshick R, Et al., Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 6, pp. 1137-1149, (2017)
  • [42] Kiela D, Bottou L., Learning Image Embeddings Using Convolutional Neural Networks for Improved Multi-Modal Semantics, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 36-45, (2014)
  • [43] Anderson P, He X D, Buehler C, Et al., Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [C], Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6077-6086, (2018)
  • [44] Dong L H, Xu S, Xu B., Speech-Transformer: A No-recurrence Sequence-to-Sequence Model for Speech Recognition, Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5884-5888, (2018)
  • [45] Purwins H, Li B, Virtanen T, Et al., Deep Learning for Audio Signal Processing, IEEE Journal of Selected Topics in Signal Processing, 13, 2, pp. 206-219, (2019)
  • [46] Hu Fengsong, Zhang Xuan, Speaker Recognition Method Based on Mel Frequency Cepstrum Coefficient and Inverted Mel Frequency Cepstrum Coefficient, Journal of Computer Applications, 32, 9, pp. 2542-2544, (2012)
  • [47] Zhang X, Yuan J L, Li L, Et al., Reducing the Bias of Visual Objects in Multimodal Named Entity Recognition, Proceedings of the 16th ACM International Conference on Web Search and Data Mining, pp. 958-966, (2023)
  • [48] Liu P P, Li H, Ren Y M, Et al., A Novel Framework for Multimodal Named Entity Recognition with Multi-level Alignments
  • [49] Khare Y, Bagal V, Mathew M, Et al., MMBERT: Multimodal BERT Pretraining for Improved Medical VQA[C], Proceedings of 2021 IEEE 18th International Symposium on Biomedical Imaging, pp. 1033-1036, (2021)
  • [50] Jiang Y G, Wu Z X, Wang J, Et al., Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks, IEEE Transactions on Pattern Analysis and Machine Intelligence, 40, 2, pp. 352-364, (2018)