Masked Vision-language Transformer in Fashion

被引:6
|
作者
Ji, Ge-Peng [1 ]
Zhuge, Mingchen [1 ]
Gao, Dehong [1 ]
Fan, Deng-Ping [2 ]
Sakaridis, Christos [2 ]
Gool, Luc Van [2 ]
机构
[1] Alibaba Grp, Int Core Business Unit, Hangzhou 310051, Peoples R China
[2] Swiss Fed Inst Technol, Comp Vis Lab, CH-8092 Zurich, Switzerland
关键词
Vision-language; masked image reconstruction; transformer; fashion; e-commercial;
D O I
10.1007/s11633-022-1394-4
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We present a masked vision-language transformer (MVLT) for fashion-specific multi-modal representation. Technically, we simply utilize the vision transformer architecture for replacing the bidirectional encoder representations from Transformers (BERT) in the pre-training model, making MVLT the first end-to-end framework for the fashion domain. Besides, we designed masked image reconstruction (MIR) for a fine-grained understanding of fashion. MVLT is an extensible and convenient architecture that admits raw multi modal inputs without extra pre-processing models (e.g., ResNet), implicitly modeling the vision-language alignments. More importantly, MVLT can easily generalize to various matching and generative tasks. Experimental results show obvious improvements in retrieval (rank@5: 17%) and recognition (accuracy: 3%) tasks over the Fashion-Gen 2018 winner, Kaleido-BERT. The code is available at https://github.com/GewelsJI/MVLT.
引用
收藏
页码:421 / 434
页数:14
相关论文
共 50 条
  • [41] Core Challenges in Embodied Vision-Language Planning
    Francis, Jonathan
    Kitamura, Nariaki
    Labelle, Felix
    Lu, Xiaopeng
    Navarro, Ingrid
    Oh, Jean
    [J]. Journal of Artificial Intelligence Research, 2022, 74 : 459 - 515
  • [42] Vision-Language Models for Robot Success Detection
    Luo, Fiona
    [J]. THIRTY-EIGTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 21, 2024, : 23750 - 23752
  • [43] Learning to Prompt for Vision-Language Emotion Recognition
    Xie, Hongxia
    Chung, Hua
    Shuai, Hong-Han
    Cheng, Wen-Huang
    [J]. 2023 11TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION WORKSHOPS AND DEMOS, ACIIW, 2023,
  • [44] Exploring Vision-Language Models for Imbalanced Learning
    Wang Y.
    Yu Z.
    Wang J.
    Heng Q.
    Chen H.
    Ye W.
    Xie R.
    Xie X.
    Zhang S.
    [J]. International Journal of Computer Vision, 2024, 132 (1) : 224 - 237
  • [45] HSVLT: Hierarchical Scale-Aware Vision-Language Transformer for Multi-Label Image Classification
    Ouyang, Shuyi
    Wang, Hongyi
    Niu, Ziwei
    Bai, Zhenjia
    Xie, Shiao
    Xu, Yingying
    Tong, Ruofeng
    Chen, Yen-Wei
    Lin, Lanfen
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4768 - 4777
  • [46] Vision-Language Navigation Policy Learning and Adaptation
    Wang, Xin
    Huang, Qiuyuan
    Celikyilmaz, Asli
    Gao, Jianfeng
    Shen, Dinghan
    Wang, Yuan-Fang
    Wang, William Yang
    Zhang, Lei
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (12) : 4205 - 4216
  • [47] Structured Scene Memory for Vision-Language Navigation
    Wang, Hanqing
    Wang, Wenguan
    Liang, Wei
    Xiong, Caiming
    Shen, Jianbing
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 8451 - 8460
  • [48] Survey on Vision-language Pre-training
    Yin, Jiong
    Zhang, Zhe-Dong
    Gao, Yu-Han
    Yang, Zhi-Wen
    Li, Liang
    Xiao, Mang
    Sun, Yao-Qi
    Yan, Cheng-Gang
    [J]. Ruan Jian Xue Bao/Journal of Software, 2023, 34 (05): : 2000 - 2023
  • [49] Task Residual for Tuning Vision-Language Models
    Yu, Tao
    Lu, Zhihe
    Jin, Xin
    Chen, Zhibo
    Wang, Xinchao
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10899 - 10909
  • [50] Perceptual Grouping in Contrastive Vision-Language Models
    Ranasinghe, Kanchana
    McKinzie, Brandon
    Ravi, Sachin
    Yang, Yinfei
    Toshev, Alexander
    Shlens, Jonathon
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 5548 - 5561