FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback

被引:28
|
作者
Goenka, Sonam [1 ]
Zheng, Zhaoheng [2 ]
Jaiswal, Ayush [1 ]
Chada, Rakesh [1 ]
Wu, Yue [1 ]
Hedau, Varsha [1 ]
Natarajan, Pradeep [1 ]
机构
[1] Amazon Alexa Nat Understanding, Berkeley, CA 94720 USA
[2] USC Viterbi Sch Engn, Los Angeles, CA USA
关键词
D O I
10.1109/CVPR52688.2022.01371
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Fashion image retrieval based on a query pair of reference image and natural language feedback is a challenging task that requires models to assess fashion related information from visual and textual modalities simultaneously. We propose a new vision-language transformer based model, FashionVLP, that brings the prior knowledge contained in large image-text corpora to the domain of fashion image retrieval, and combines visual information from multiple levels of context to effectively capture fashion-related information.While queries are encoded through the transformer layers, our asymmetric design adopts a novel attention-based approach for fusing target image features without involving text or transformer layers in the process. Extensive results show that FashionVLP achieves the state-of-the-art performance on benchmark datasets, with a large 23% relative improvement on the challenging FashionlQ dataset, which contains complex natural language feedback.
引用
收藏
页码:14085 / 14095
页数:11
相关论文
共 50 条
  • [1] Masked Vision-language Transformer in Fashion
    Ge-Peng Ji
    Mingchen Zhuge
    Dehong Gao
    Deng-Ping Fan
    Christos Sakaridis
    Luc Van Gool
    [J]. Machine Intelligence Research, 2023, 20 : 421 - 434
  • [2] Masked Vision-language Transformer in Fashion
    Ji, Ge-Peng
    Zhuge, Mingchen
    Gao, Dehong
    Fan, Deng-Ping
    Sakaridis, Christos
    Gool, Luc Van
    [J]. MACHINE INTELLIGENCE RESEARCH, 2023, 20 (03) : 421 - 434
  • [3] Conversational Fashion Image Retrieval via Multiturn Natural Language Feedback
    Yuan, Yifei
    Lam, Wai
    [J]. SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 839 - 848
  • [4] VLDeformer: Vision-Language Decomposed Transformer for fast cross-modal retrieval
    Zhang, Lisai
    Wu, Hongfa
    Chen, Qingcai
    Deng, Yimeng
    Siebert, Joanna
    Li, Zhonghua
    Han, Yunpeng
    Kong, Dejiang
    Cao, Zhao
    [J]. KNOWLEDGE-BASED SYSTEMS, 2022, 252
  • [5] Contrastive hashing with vision transformer for image retrieval
    Ren, Xiuxiu
    Zheng, Xiangwei
    Zhou, Huiyu
    Liu, Weilong
    Dong, Xiao
    [J]. INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2022, 37 (12) : 12192 - 12211
  • [6] Deep Vision-Language Model for Efficient Multi-modal Similarity Search in Fashion Retrieval
    Moro, Gianluca
    Salvatori, Stefano
    [J]. SIMILARITY SEARCH AND APPLICATIONS (SISAP 2022), 2022, 13590 : 40 - 53
  • [7] Dynamic Network for Language-based Fashion Retrieval
    Li, Hangfei
    Wu, Yiming
    Wang, Fangfang
    [J]. PROCEEDINGS OF THE 1ST INTERNATIONAL WORKSHOP ON DEEP MULTIMODAL LEARNING FOR INFORMATION RETRIEVAL, MMIR 2023, 2023, : 49 - 57
  • [8] Investigating the Vision Transformer Model for Image Retrieval Tasks
    Gkelios, Socratis
    Boutalis, Yiannis
    Chatzichristofis, Savvas A.
    [J]. 17TH ANNUAL INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING IN SENSOR SYSTEMS (DCOSS 2021), 2021, : 367 - 373
  • [9] Image Retrieval Based on Vision Transformer and Masked Learning
    李锋
    潘煌圣
    盛守祥
    王国栋
    [J]. Journal of Donghua University(English Edition), 2023, 40 (05) : 539 - 547
  • [10] Contrastive language and vision learning of general fashion concepts
    Patrick John Chia
    Giuseppe Attanasio
    Federico Bianchi
    Silvia Terragni
    Ana Rita Magalhães
    Diogo Goncalves
    Ciro Greco
    Jacopo Tagliabue
    [J]. Scientific Reports, 12