FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback

被引：28

作者：

Goenka, Sonam ^{[1
]}

Zheng, Zhaoheng ^{[2
]}

Jaiswal, Ayush ^{[1
]}

Chada, Rakesh ^{[1
]}

Wu, Yue ^{[1
]}

Hedau, Varsha ^{[1
]}

Natarajan, Pradeep ^{[1
]}

机构：

[1] Amazon Alexa Nat Understanding, Berkeley, CA 94720 USA

[2] USC Viterbi Sch Engn, Los Angeles, CA USA

来源：

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2022年

关键词：

D O I：

10.1109/CVPR52688.2022.01371

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Fashion image retrieval based on a query pair of reference image and natural language feedback is a challenging task that requires models to assess fashion related information from visual and textual modalities simultaneously. We propose a new vision-language transformer based model, FashionVLP, that brings the prior knowledge contained in large image-text corpora to the domain of fashion image retrieval, and combines visual information from multiple levels of context to effectively capture fashion-related information.While queries are encoded through the transformer layers, our asymmetric design adopts a novel attention-based approach for fusing target image features without involving text or transformer layers in the process. Extensive results show that FashionVLP achieves the state-of-the-art performance on benchmark datasets, with a large 23% relative improvement on the challenging FashionlQ dataset, which contains complex natural language feedback.

引用

页码：14085 / 14095

页数：11

共 50 条

[1] Masked Vision-language Transformer in Fashion
Ge-Peng Ji
Mingchen Zhuge
Dehong Gao
Deng-Ping Fan
Christos Sakaridis
Luc Van Gool
[J]. Machine Intelligence Research, 2023, 20 : 421 - 434
[2] Masked Vision-language Transformer in Fashion
Ji, Ge-Peng
Zhuge, Mingchen
Gao, Dehong
Fan, Deng-Ping
Sakaridis, Christos
Gool, Luc Van
[J]. MACHINE INTELLIGENCE RESEARCH, 2023, 20 (03) : 421 - 434
[3] Conversational Fashion Image Retrieval via Multiturn Natural Language Feedback
Yuan, Yifei
Lam, Wai
[J]. SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 839 - 848
[4] VLDeformer: Vision-Language Decomposed Transformer for fast cross-modal retrieval
Zhang, Lisai
Wu, Hongfa
Chen, Qingcai
Deng, Yimeng
Siebert, Joanna
Li, Zhonghua
Han, Yunpeng
Kong, Dejiang
Cao, Zhao
[J]. KNOWLEDGE-BASED SYSTEMS, 2022, 252
[5] Contrastive hashing with vision transformer for image retrieval
Ren, Xiuxiu
Zheng, Xiangwei
Zhou, Huiyu
Liu, Weilong
Dong, Xiao
[J]. INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2022, 37 (12) : 12192 - 12211
[6] Deep Vision-Language Model for Efficient Multi-modal Similarity Search in Fashion Retrieval
Moro, Gianluca
Salvatori, Stefano
[J]. SIMILARITY SEARCH AND APPLICATIONS (SISAP 2022), 2022, 13590 : 40 - 53
[7] Dynamic Network for Language-based Fashion Retrieval
Li, Hangfei
Wu, Yiming
Wang, Fangfang
[J]. PROCEEDINGS OF THE 1ST INTERNATIONAL WORKSHOP ON DEEP MULTIMODAL LEARNING FOR INFORMATION RETRIEVAL, MMIR 2023, 2023, : 49 - 57
[8] Investigating the Vision Transformer Model for Image Retrieval Tasks
Gkelios, Socratis
Boutalis, Yiannis
Chatzichristofis, Savvas A.
[J]. 17TH ANNUAL INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING IN SENSOR SYSTEMS (DCOSS 2021), 2021, : 367 - 373
[9] Image Retrieval Based on Vision Transformer and Masked Learning
李锋
潘煌圣
盛守祥
王国栋
[J]. Journal of Donghua University(English Edition), 2023, 40 (05) : 539 - 547
[10] Contrastive language and vision learning of general fashion concepts
Patrick John Chia
Giuseppe Attanasio
Federico Bianchi
Silvia Terragni
Ana Rita Magalhães
Diogo Goncalves
Ciro Greco
Jacopo Tagliabue
[J]. Scientific Reports, 12

← 1 2 3 4 5 →