Multimodal image retrieval model based on semantic-enhanced feature fusion

被引:0
|
作者
Yang F. [1 ]
Ning B. [1 ]
Li H.-Q. [1 ]
Zhou X. [1 ]
Li G.-Y. [1 ]
机构
[1] School of Information Science and Technology, Dalian Maritime University, Dalian
关键词
attention mechanism; feature fusion; image retrieval; multimodality; semantic enhancement;
D O I
10.3785/j.issn.1008-973X.2023.02.005
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
A multimodal image retrieval model based on semantic-enhanced feature fusion (SEFM) was proposed to establish the correlation between text features and image features in multimodal image retrieval tasks. Semantic enhancement was conducted on the combined features during feature fusion by two proposed modules including the text semantic enhancement module and the image semantic enhancement module. Firstly, to enhance the text semantics, a multimodal dual attention mechanism was established in the text semantic enhancement module, which associated the multimodal correlation between text and image. Secondly, to enhance the image semantics, the retain intensity and update intensity were introduced in the image semantic enhancement module, which controlled the retaining and updating degrees of the query image features in combined features. Based on the above two modules, the combined features can be optimized, and be closer to the target image features. In the experiment part, the SEFM model was evaluated on MIT-States and Fashion IQ datasets, and experimental results show that the proposed model performs better than the existing works on recall and precision metrics. © 2023 Zhejiang University. All rights reserved.
引用
收藏
页码:252 / 258
页数:6
相关论文
共 6 条
  • [1] DUBEY S R., A decade survey of content based image retrieval using deep learning [J], IEEE Transactions on Circuits and Systems for Video Technology, 32, 5, pp. 2687-2704, (2022)
  • [2] PANG K T, LI K, YANG Y X, Et al., Generalising fine-grained sketch-based image retrieval [C], Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 677-686, (2019)
  • [3] LIN T Y, CUI Y, BELONGIE S, Et al., Learning deep representations for ground-to-aerial geolocalization [C], 2015 IEEE Conference on Computer Vision and Pattern Recognition, pp. 5007-5015, (2015)
  • [4] ZHANG M, MAIDMENT T, DIAB A, Et al., Domain-robust VQA with diverse datasets and methods but no target labels [C], Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7046-7056, (2021)
  • [5] CHEN L, JIANG Z, XIAO J, Et al., Human-like controllable image captioning with verb-specific semantic roles [C], Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16846-16856, (2021)
  • [6] SANTORO A, RAPOSO D, BARRETT D G T, Et al., A simple neural network module for relational reasoning [C], Advances in Neural Information Processing Systems, 30, pp. 4967-4976, (2017)