MVItem: A Benchmark for Multi-View Cross-Modal Item Retrieval

被引：0

作者：

Li, Bo ^{[1
]}

Zhu, Jiansheng ^{[2
]}

Dai, Linlin ^{[3
]}

Jing, Hui ^{[3
]}

Huang, Zhizheng ^{[3
]}

Sui, Yuteng ^{[1
]}

机构：

[1] China Acad Railway Sci, Postgrad Dept, Beijing 100081, Peoples R China

[2] China Railway, Dept Sci Technol & Informat, Beijing 100844, Peoples R China

[3] China Acad Railway Sci Corp Ltd, Inst Comp Technol, Beijing 100081, Peoples R China

来源：

IEEE ACCESS | 2024年 / 12卷

关键词：

Annotations; Benchmark testing; Text to image; Deep learning; Contrastive learning; Open source software; Cross-model retrieval; deep learning; item retrieval; contrastive text-image pre-training model; multi-view;

D O I：

10.1109/ACCESS.2024.3447872

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The existing text-image pre-training models have demonstrated strong generalization capabilities, however, their performance of item retrieval in real-world scenarios still falls short of expectations. In order to optimize the performance of text-image pre-training model to retrieve items in real scenarios, we present a benchmark called MVItem for exploring multi-view item retrieval based on the open-source dataset MVImgNet. Firstly, we evenly sample items in MVImgNet to obtain 5 images from different views, and automatically annotate this images based on MiniGPT-4. Subsequently, through manual cleaning and comparison, we present a high-quality textual description for each sample. Then, in order to investigate the spatial misalignment problem of item retrieval in real-world scenarios and mitigate the impact of spatial misalignment on retrieval, we devise a multi-view feature fusion strategy and propose a cosine distance balancing method based on Sequential Least Squares Programming (SLSQP) to achieve the fusion of multiple view vectors, namely balancing cosine distance(BCD). On this basis, we select the representative state-of-the-art text-image pre-training retrieval models as baselines, and establish multiple test groups to explore the effectiveness of multi-view information on item retrieval to easing potential spatial misalignment. The experimental results show that the retrieval of fusing multi-view features is generally better than that of the baseline, indicating that multi-view feature fusion is helpful to alleviate the impact of spatial misalignment on item retrieval. Moreover, the proposed feature fusion, balancing cosine distance(BCD), is generally better than that of feature averaging, denoted as balancing euclidean distance(BED) in this work. At the results, we find that the fusion of multiple images with different views is more helpful for text-to-image (T2I) retrieval, and the fusion of a small number of images with large differences in views is more helpful for image-to-image (I2I) retrieval.

引用

页码：119563 / 119576

页数：14

共 50 条

[1] ROBUST MULTI-VIEW HASHING FOR CROSS-MODAL RETRIEVAL
Wang, Haitao
Chen, Hui
Meng, Min
Wu, JiGang
2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 1012 - 1017
[2] MULTI-VIEW FUSION THROUGH CROSS-MODAL RETRIEVAL
Cui, Limeng
Chen, Zhensong
Zhang, Jiawei
He, Lifang
Shi, Yong
Yu, Philip S.
2018 25TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2018, : 1977 - 1981
[3] Generalized Multi-View Embedding for Visual Recognition and Cross-Modal Retrieval
Cao, Guanqun
Iosifidis, Alexandros
Chen, Ke
Gabbouj, Moncef
IEEE TRANSACTIONS ON CYBERNETICS, 2018, 48 (09) : 2542 - 2555
[4] Multi-view visual semantic embedding for cross-modal image–text retrieval
Li, Zheng
Guo, Caili
Wang, Xin
Zhang, Hao
Hu, Lin
Pattern Recognition, 2025, 159
[5] Multi-view Multi-label Canonical Correlation Analysis for Cross-modal Matching and Retrieval
Sanghavi, Rushil
Verma, Yashaswi
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 4700 - 4709
[6] Learning discriminative hashing codes for cross-modal retrieval based on multi-view features
Jun Yu
Xiao-Jun Wu
Josef Kittler
Pattern Analysis and Applications, 2020, 23 : 1421 - 1438
[7] Learning discriminative hashing codes for cross-modal retrieval based on multi-view features
Yu, Jun
Wu, Xiao-Jun
Kittler, Josef
PATTERN ANALYSIS AND APPLICATIONS, 2020, 23 (03) : 1421 - 1438
[8] Multi-view collective tensor decomposition for cross-modal hashing
Limeng Cui
Jiawei Zhang
Lifang He
Philip S. Yu
International Journal of Multimedia Information Retrieval, 2019, 8 : 47 - 59
[9] Multi-view collective tensor decomposition for cross-modal hashing
Cui, Limeng
Zhang, Jiawei
He, Lifang
Yu, Philip S.
INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2019, 8 (01) : 47 - 59
[10] Multi-view Collective Tensor Decomposition for Cross-modal Hashing
Cui, Limeng
Chen, Zhensong
Zhang, Jiawei
He, Lifang
Shi, Yong
Yu, Philip S.
ICMR '18: PROCEEDINGS OF THE 2018 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2018, : 73 - 81

← 1 2 3 4 5 →