STAIR: Learning Sparse Text and Image Representation in Grounded Tokens

被引:0
|
作者
Chen, Chen [1 ]
Zhang, Bowen [1 ]
Cao, Liangliang [1 ]
Shen, Jiguang [1 ]
Gunter, Tom [1 ]
Jose, Albin Madappally [1 ]
Toshev, Alexander [1 ]
Zheng, Yantao [1 ]
Shlenst, Jonathon [1 ]
Pang, Ruoming [1 ]
Yang, Yinfei [1 ]
机构
[1] Apple AI ML, Beijing, Peoples R China
关键词
CLASSIFICATION;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image and text retrieval is one of the foundational tasks in the vision and language domain with multiple real-world applications. State-of-the-art contrastive approaches, e.g. CLIP (Radford et al., 2021), ALIGN (Jia et al., 2021), represent images and texts as dense embeddings and calculate the similarity in the dense embedding space as the matching score. On the other hand, sparse semantic features like bag-of-words models are inherently more interpretable, but believed to suffer from inferior accuracy than dense representations. In this work, we show that it is possible to build a sparse semantic representation that is as powerful as, or even better than, dense presentations. We extend the CLIP model and build a sparse text and image representation (STAIR), where the image and text are mapped to a sparse token space. Each token in the space is a (sub-)word in the vocabulary, which is not only interpretable but also easy to integrate with existing information retrieval systems. STAIR model significantly outperforms a CLIP model with +4.9% and +4.3% absolute Recall@1 improvement on COCO-5k text -> image and image -> text retrieval respectively. It also achieved better performance on both of ImageNet zero-shot and linear probing compared to CLIP. (1)
引用
收藏
页码:15079 / 15094
页数:16
相关论文
共 50 条
  • [21] UNSUPERVISED LEARNING OF COMPOSITIONAL SPARSE CODE FOR NATURAL IMAGE REPRESENTATION
    Hong, Yi
    Si, Zhangzhang
    Hu, Wenze
    Zhu, Song-Chun
    Wu, Ying Nian
    QUARTERLY OF APPLIED MATHEMATICS, 2014, 72 (02) : 373 - 406
  • [22] Sparse representation for image classification via paired dictionary learning
    Wang, Hui-Hung
    Tu, Chia-Wei
    Chiang, Chen-Kuo
    MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (12) : 16945 - 16963
  • [23] INCOHERENT DICTIONARY LEARNING FOR SPARSE REPRESENTATION BASED IMAGE DENOISING
    Wang, Jin
    Cai, Jian-Feng
    Shi, Yunhui
    Yin, Baocai
    2014 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2014, : 4582 - 4586
  • [24] LEARNING SPARSE LATENT REPRESENTATION AND DISTANCE METRIC FOR IMAGE RETRIEVAL
    Tu Dinh Nguyen
    Truyen Tran
    Dinh Phung
    Venkatesh, Svetha
    2013 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME 2013), 2013,
  • [25] Color image denoising via dictionary learning and sparse representation
    Zhu, Rong
    Wang, Yong
    Journal of Computational and Theoretical Nanoscience, 2015, 12 (10) : 3911 - 3916
  • [26] Sparse representation for image classification via paired dictionary learning
    Hui-Hung Wang
    Chia-Wei Tu
    Chen-Kuo Chiang
    Multimedia Tools and Applications, 2019, 78 : 16945 - 16963
  • [27] Sparse Representation of Robot Image Based on Dictionary Learning Algorithm
    Guo J.-F.
    Li Y.-L.
    Zidonghua Xuebao/Acta Automatica Sinica, 2020, 46 (04): : 820 - 830
  • [28] HYPERSPECTRAL IMAGE CLASSIFICATION WITH SPARSE REPRESENTATION CLASSIFIER AND ACTIVE LEARNING
    Huo, Lian-Zhi
    Zhao, Li-Jun
    Tang, Ping
    2016 8TH WORKSHOP ON HYPERSPECTRAL IMAGE AND SIGNAL PROCESSING: EVOLUTION IN REMOTE SENSING (WHISPERS), 2016,
  • [29] Polarimetric SAR Image Classification by Multitask Sparse Representation Learning
    Li, Bo
    Li, Ying
    Chen, Minxia
    2018 7TH INTERNATIONAL CONFERENCE ON DIGITAL HOME (ICDH 2018), 2018, : 31 - 36
  • [30] Improved image representation and sparse representation for image classification
    Shijun Zheng
    Yongjun Zhang
    Wenjie Liu
    Yongjie Zou
    Applied Intelligence, 2020, 50 : 1687 - 1698