Masked Vision-language Transformer in Fashion

被引:6
|
作者
Ji, Ge-Peng [1 ]
Zhuge, Mingchen [1 ]
Gao, Dehong [1 ]
Fan, Deng-Ping [2 ]
Sakaridis, Christos [2 ]
Gool, Luc Van [2 ]
机构
[1] Alibaba Grp, Int Core Business Unit, Hangzhou 310051, Peoples R China
[2] Swiss Fed Inst Technol, Comp Vis Lab, CH-8092 Zurich, Switzerland
关键词
Vision-language; masked image reconstruction; transformer; fashion; e-commercial;
D O I
10.1007/s11633-022-1394-4
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We present a masked vision-language transformer (MVLT) for fashion-specific multi-modal representation. Technically, we simply utilize the vision transformer architecture for replacing the bidirectional encoder representations from Transformers (BERT) in the pre-training model, making MVLT the first end-to-end framework for the fashion domain. Besides, we designed masked image reconstruction (MIR) for a fine-grained understanding of fashion. MVLT is an extensible and convenient architecture that admits raw multi modal inputs without extra pre-processing models (e.g., ResNet), implicitly modeling the vision-language alignments. More importantly, MVLT can easily generalize to various matching and generative tasks. Experimental results show obvious improvements in retrieval (rank@5: 17%) and recognition (accuracy: 3%) tasks over the Fashion-Gen 2018 winner, Kaleido-BERT. The code is available at https://github.com/GewelsJI/MVLT.
引用
收藏
页码:421 / 434
页数:14
相关论文
共 50 条
  • [1] Masked Vision-language Transformer in Fashion
    Ge-Peng Ji
    Mingchen Zhuge
    Dehong Gao
    Deng-Ping Fan
    Christos Sakaridis
    Luc Van Gool
    [J]. Machine Intelligence Research, 2023, 20 : 421 - 434
  • [2] TVLT: Textless Vision-Language Transformer
    Tang, Zineng
    Cho, Jaemin
    Nie, Yixin
    Bansal, Mohit
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [3] Vision-Language Transformer and Query Generation for Referring Segmentation
    Ding, Henghui
    Liu, Chang
    Wang, Suchen
    Jiang, Xudong
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 16301 - 16310
  • [4] Vision-Language Transformer for Interpretable Pathology Visual Question Answering
    Naseem, Usman
    Khushi, Matloob
    Kim, Jinman
    [J]. IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2023, 27 (04) : 1681 - 1690
  • [5] VLT: Vision-Language Transformer and Query Generation for Referring Segmentation
    Ding, Henghui
    Liu, Chang
    Wang, Suchen
    Jiang, Xudong
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (06) : 7900 - 7916
  • [6] MAGVLT: Masked Generative Vision-and-Language Transformer
    Kim, Sungwoong
    Jo, Daejin
    Lee, Donghoon
    Kim, Jongmin
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23338 - 23348
  • [7] FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback
    Goenka, Sonam
    Zheng, Zhaoheng
    Jaiswal, Ayush
    Chada, Rakesh
    Wu, Yue
    Hedau, Varsha
    Natarajan, Pradeep
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 14085 - 14095
  • [8] Target-Driven Structured Transformer Planner for Vision-Language Navigation
    Zhao, Yusheng
    Chen, Jinyu
    Gao, Chen
    Wang, Wenguan
    Yang, Lirong
    Ren, Haibing
    Xia, Huaxia
    Liu, Si
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4194 - 4203
  • [9] Unifying Vision-Language Representation Space with Single-Tower Transformer
    Jang, Jiho
    Kong, Chaerin
    Jeon, Donghyeon
    Kim, Seonhoon
    Kwak, Nojun
    [J]. THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 980 - 988
  • [10] Kaleido-BERT: Vision-Language Pre-training on Fashion Domain
    Zhuge, Mingchen
    Gao, Dehong
    Fan, Deng-Ping
    Jin, Linbo
    Chen, Ben
    Zhou, Haoming
    Qiu, Minghui
    Shao, Ling
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 12642 - 12652