STAIR: Learning Sparse Text and Image Representation in Grounded Tokens

被引:0
|
作者
Chen, Chen [1 ]
Zhang, Bowen [1 ]
Cao, Liangliang [1 ]
Shen, Jiguang [1 ]
Gunter, Tom [1 ]
Jose, Albin Madappally [1 ]
Toshev, Alexander [1 ]
Zheng, Yantao [1 ]
Shlenst, Jonathon [1 ]
Pang, Ruoming [1 ]
Yang, Yinfei [1 ]
机构
[1] Apple AI ML, Beijing, Peoples R China
关键词
CLASSIFICATION;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image and text retrieval is one of the foundational tasks in the vision and language domain with multiple real-world applications. State-of-the-art contrastive approaches, e.g. CLIP (Radford et al., 2021), ALIGN (Jia et al., 2021), represent images and texts as dense embeddings and calculate the similarity in the dense embedding space as the matching score. On the other hand, sparse semantic features like bag-of-words models are inherently more interpretable, but believed to suffer from inferior accuracy than dense representations. In this work, we show that it is possible to build a sparse semantic representation that is as powerful as, or even better than, dense presentations. We extend the CLIP model and build a sparse text and image representation (STAIR), where the image and text are mapped to a sparse token space. Each token in the space is a (sub-)word in the vocabulary, which is not only interpretable but also easy to integrate with existing information retrieval systems. STAIR model significantly outperforms a CLIP model with +4.9% and +4.3% absolute Recall@1 improvement on COCO-5k text -> image and image -> text retrieval respectively. It also achieved better performance on both of ImageNet zero-shot and linear probing compared to CLIP. (1)
引用
收藏
页码:15079 / 15094
页数:16
相关论文
共 50 条
  • [31] Improved image representation and sparse representation for image classification
    Zheng, Shijun
    Zhang, Yongjun
    Liu, Wenjie
    Zou, Yongjie
    APPLIED INTELLIGENCE, 2020, 50 (06) : 1687 - 1698
  • [32] Joint Image-text Representation Learning for Fashion Retrieval
    Yan, Cairong
    Li, Yu
    Wan, Yongquan
    Zhang, Zhaohui
    ICMLC 2020: 2020 12TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND COMPUTING, 2018, : 412 - 417
  • [33] Sparse molecular image representation
    Karygianni, Sofia
    Frossard, Pascal
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2016, 36 : 213 - 228
  • [34] Image Fusion with Sparse Representation
    Li, Hong
    Zhang, Jinping
    Wu, Fenxia
    Tan, Conge
    ADVANCES IN APPLIED SCIENCE AND INDUSTRIAL TECHNOLOGY, PTS 1 AND 2, 2013, 798-799 : 737 - +
  • [35] Sparse Image Representation by Directionlets
    Velisavljevic, Vladan
    Vetterli, Martin
    Beferull-Lozano, Baltasar
    Dragotti, Pier Luigi
    ADVANCES IN IMAGING AND ELECTRON PHYSICS, VOL 161, 2010, 161 : 147 - 209
  • [36] SPARSE REPRESENTATION ON SINGLE IMAGE
    Tan, Jin
    Zhang, Taiping
    Tang, Yuan Yan
    PROCEEDINGS OF 2019 INTERNATIONAL CONFERENCE ON WAVELET ANALYSIS AND PATTERN RECOGNITION (ICWAPR), 2019, : 44 - 49
  • [37] Sparse Image Representation with Epitomes
    Benoit, Louise
    Mairal, Julien
    Bach, Francis
    Ponce, Jean
    2011 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2011,
  • [38] Remote Sensing Image Fusion Based on Dictionary Learning and Sparse Representation
    Yin, Fei
    Cao, Shuhua
    Xu, Xiaojie
    2019 INTERNATIONAL CONFERENCE ON IMAGE AND VIDEO PROCESSING, AND ARTIFICIAL INTELLIGENCE, 2019, 11321
  • [39] Graph Learning via Edge Constrained Sparse Representation for image Analysis
    Pei, Xiaobing
    Zou, Junjun
    Chen, Weiya
    IEEE ACCESS, 2019, 7 : 42408 - 42417
  • [40] Sparse Representation Based Fisher Discrimination Dictionary Learning for Image Classification
    Yang, Meng
    Zhang, Lei
    Feng, Xiangchu
    Zhang, David
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2014, 109 (03) : 209 - 232