STAIR: Learning Sparse Text and Image Representation in Grounded Tokens

被引：0

作者：

Chen, Chen ^{[1
]}

Zhang, Bowen ^{[1
]}

Cao, Liangliang ^{[1
]}

Shen, Jiguang ^{[1
]}

Gunter, Tom ^{[1
]}

Jose, Albin Madappally ^{[1
]}

Toshev, Alexander ^{[1
]}

Zheng, Yantao ^{[1
]}

Shlenst, Jonathon ^{[1
]}

Pang, Ruoming ^{[1
]}

Yang, Yinfei ^{[1
]}

机构：

[1] Apple AI ML, Beijing, Peoples R China

来源：

2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023) | 2023年

关键词：

CLASSIFICATION;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Image and text retrieval is one of the foundational tasks in the vision and language domain with multiple real-world applications. State-of-the-art contrastive approaches, e.g. CLIP (Radford et al., 2021), ALIGN (Jia et al., 2021), represent images and texts as dense embeddings and calculate the similarity in the dense embedding space as the matching score. On the other hand, sparse semantic features like bag-of-words models are inherently more interpretable, but believed to suffer from inferior accuracy than dense representations. In this work, we show that it is possible to build a sparse semantic representation that is as powerful as, or even better than, dense presentations. We extend the CLIP model and build a sparse text and image representation (STAIR), where the image and text are mapped to a sparse token space. Each token in the space is a (sub-)word in the vocabulary, which is not only interpretable but also easy to integrate with existing information retrieval systems. STAIR model significantly outperforms a CLIP model with +4.9% and +4.3% absolute Recall@1 improvement on COCO-5k text -> image and image -> text retrieval respectively. It also achieved better performance on both of ImageNet zero-shot and linear probing compared to CLIP. (1)

引用

页码：15079 / 15094

页数：16

共 50 条

[31] Improved image representation and sparse representation for image classification
Zheng, Shijun
Zhang, Yongjun
Liu, Wenjie
Zou, Yongjie
APPLIED INTELLIGENCE, 2020, 50 (06) : 1687 - 1698
[32] Joint Image-text Representation Learning for Fashion Retrieval
Yan, Cairong
Li, Yu
Wan, Yongquan
Zhang, Zhaohui
ICMLC 2020: 2020 12TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND COMPUTING, 2018, : 412 - 417
[33] Sparse molecular image representation
Karygianni, Sofia
Frossard, Pascal
JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2016, 36 : 213 - 228
[34] Image Fusion with Sparse Representation
Li, Hong
Zhang, Jinping
Wu, Fenxia
Tan, Conge
ADVANCES IN APPLIED SCIENCE AND INDUSTRIAL TECHNOLOGY, PTS 1 AND 2, 2013, 798-799 : 737 - +
[35] Sparse Image Representation by Directionlets
Velisavljevic, Vladan
Vetterli, Martin
Beferull-Lozano, Baltasar
Dragotti, Pier Luigi
ADVANCES IN IMAGING AND ELECTRON PHYSICS, VOL 161, 2010, 161 : 147 - 209
[36] SPARSE REPRESENTATION ON SINGLE IMAGE
Tan, Jin
Zhang, Taiping
Tang, Yuan Yan
PROCEEDINGS OF 2019 INTERNATIONAL CONFERENCE ON WAVELET ANALYSIS AND PATTERN RECOGNITION (ICWAPR), 2019, : 44 - 49
[37] Sparse Image Representation with Epitomes
Benoit, Louise
Mairal, Julien
Bach, Francis
Ponce, Jean
2011 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2011,
[38] Remote Sensing Image Fusion Based on Dictionary Learning and Sparse Representation
Yin, Fei
Cao, Shuhua
Xu, Xiaojie
2019 INTERNATIONAL CONFERENCE ON IMAGE AND VIDEO PROCESSING, AND ARTIFICIAL INTELLIGENCE, 2019, 11321
[39] Graph Learning via Edge Constrained Sparse Representation for image Analysis
Pei, Xiaobing
Zou, Junjun
Chen, Weiya
IEEE ACCESS, 2019, 7 : 42408 - 42417
[40] Sparse Representation Based Fisher Discrimination Dictionary Learning for Image Classification
Yang, Meng
Zhang, Lei
Feng, Xiangchu
Zhang, David
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2014, 109 (03) : 209 - 232

← 1 2 3 4 5 →