STAIR: Learning Sparse Text and Image Representation in Grounded Tokens

被引:0
|
作者
Chen, Chen [1 ]
Zhang, Bowen [1 ]
Cao, Liangliang [1 ]
Shen, Jiguang [1 ]
Gunter, Tom [1 ]
Jose, Albin Madappally [1 ]
Toshev, Alexander [1 ]
Zheng, Yantao [1 ]
Shlenst, Jonathon [1 ]
Pang, Ruoming [1 ]
Yang, Yinfei [1 ]
机构
[1] Apple AI ML, Beijing, Peoples R China
关键词
CLASSIFICATION;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image and text retrieval is one of the foundational tasks in the vision and language domain with multiple real-world applications. State-of-the-art contrastive approaches, e.g. CLIP (Radford et al., 2021), ALIGN (Jia et al., 2021), represent images and texts as dense embeddings and calculate the similarity in the dense embedding space as the matching score. On the other hand, sparse semantic features like bag-of-words models are inherently more interpretable, but believed to suffer from inferior accuracy than dense representations. In this work, we show that it is possible to build a sparse semantic representation that is as powerful as, or even better than, dense presentations. We extend the CLIP model and build a sparse text and image representation (STAIR), where the image and text are mapped to a sparse token space. Each token in the space is a (sub-)word in the vocabulary, which is not only interpretable but also easy to integrate with existing information retrieval systems. STAIR model significantly outperforms a CLIP model with +4.9% and +4.3% absolute Recall@1 improvement on COCO-5k text -> image and image -> text retrieval respectively. It also achieved better performance on both of ImageNet zero-shot and linear probing compared to CLIP. (1)
引用
收藏
页码:15079 / 15094
页数:16
相关论文
共 50 条
  • [1] Sparse Representation Classification for Image Text Detection
    Zhao, Ming
    Li, Shutao
    SECOND INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND DESIGN, VOL 1, PROCEEDINGS, 2009, : 76 - 79
  • [2] Learning Sparse Representation for Leaf Image Recognition
    Hsiao, Jou-Ken
    Kang, Li-Wei
    Chang, Ching-Long
    Hsu, Chao-Yung
    Chen, Chia-Yen
    2014 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS - TAIWAN (ICCE-TW), 2014,
  • [3] Learning Oriented Dictionary for Sparse Image Representation
    Liang, Ruihua
    Cheng, Lizhi
    Chen, Chen
    2011 INTERNATIONAL CONFERENCE ON ELECTRONICS, COMMUNICATIONS AND CONTROL (ICECC), 2011, : 1529 - 1532
  • [4] LEARNING DOUBLY SPARSE TRANSFORMS FOR IMAGE REPRESENTATION
    Ravishankar, Saiprasad
    Bresler, Yoram
    2012 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP 2012), 2012, : 685 - 688
  • [5] Video Text Detection using Image Edge and Sparse Representation
    Qi, Xiaoyan
    Li, Shutao
    PROCEEDING OF THE 10TH INTERNATIONAL CONFERENCE ON INTELLIGENT TECHNOLOGIES, 2009, : 415 - 419
  • [6] Consensus Graph Representation Learning for Better Grounded Image Captioning
    Zhang, Wenqiao
    Shi, Haochen
    Tang, Siliang
    Xiao, Jun
    Yu, Qiang
    Zhuang, Yueting
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 3394 - 3402
  • [7] Laplacian sparse dictionary learning for image classification based on sparse representation
    Li, Fang
    Sheng, Jia
    Zhang, San-yuan
    FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2017, 18 (11) : 1795 - 1805
  • [8] Laplacian sparse dictionary learning for image classification based on sparse representation
    Fang Li
    Jia Sheng
    San-yuan Zhang
    Frontiers of Information Technology & Electronic Engineering, 2017, 18 : 1795 - 1805
  • [9] Automatic Dictionary Learning Sparse Representation for Image Denoising
    Li, Hongjun
    Hu, Wei
    Wang, Wei
    Xie, Zhengguang
    JOURNAL OF GREY SYSTEM, 2018, 30 (02): : 57 - 69
  • [10] Personalized Image Retrieval with Sparse Graph Representation Learning
    Jia, Xiaowei
    Zhao, Handong
    Lin, Zhe
    Kale, Ajinkya
    Kumar, Vipin
    KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 2735 - 2743