SINC: Self-Supervised In-Context Learning for Vision-Language Tasks

被引:0
|
作者
Chen, Yi-Syuan [1 ]
Song, Yun-Zhu [1 ]
Yeo, Cheng Yu [1 ]
Liu, Bei [2 ]
Fu, Jianlong [2 ]
Shuai, Hong-Han [1 ]
机构
[1] Natl Yang Ming Chiao Tung Univ, Hsinchu, Taiwan
[2] Microsoft Res Asia, Beijing, Peoples R China
关键词
D O I
10.1109/ICCV51070.2023.01415
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large Pre-trained Transformers exhibit an intriguing capacity for in-context learning. Without gradient updates, these models can rapidly construct new predictors from demonstrations presented in the inputs. Recent works promote this ability in the vision-language domain by incorporating visual information into large language models that can already make in-context predictions. However, these methods could inherit issues in the language domain, such as template sensitivity and hallucination. Also, the scale of these language models raises a significant demand for computations, making learning and operating these models resource-intensive. To this end, we raise a question: "How can we enable in-context learning without relying on the intrinsic in-context ability of large language models?". To answer it, we propose a succinct and general framework, Self-supervised IN-Context learning (SINC), that introduces a meta-model to learn on self-supervised prompts consisting of tailored demonstrations. The learned models can be transferred to downstream tasks for making incontext predictions on-the-fly. Extensive experiments show that SINC outperforms gradient-based methods in various vision-language tasks under few-shot settings. Furthermore, the designs of SINC help us investigate the benefits of in-context learning across different tasks, and the analysis further reveals the essential components for the emergence of in-context learning in the vision-language domain.
引用
下载
收藏
页码:15384 / 15396
页数:13
相关论文
共 50 条
  • [1] MetaVL: Transferring In-Context Learning Ability From Language Models to Vision-Language Models
    Monajatipoor, Masoud
    Li, Liunian Harold
    Rouhsedaghat, Mozhdeh
    Yang, Lin F.
    Chang, Kai-Wei
    61ST CONFERENCE OF THE THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 2, 2023, : 495 - 508
  • [2] SELF-SUPERVISED VISION-LANGUAGE PRETRAINING FOR MEDIAL VISUAL QUESTION ANSWERING
    Li, Pengfei
    Liu, Gang
    Tan, Lin
    Liao, Jinying
    Zhong, Shenjun
    2023 IEEE 20TH INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING, ISBI, 2023,
  • [3] Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation
    Wang, Xin
    Huang, Qiuyuan
    Celikyilmaz, Asli
    Gao, Jianfeng
    Shen, Dinghan
    Wang, Yuan-Fang
    Wang, William Yang
    Zhang, Lei
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 3622 - 6631
  • [4] Improving In-Context Few-Shot Learning via Self-Supervised Training
    Chen, Mingda
    Du, Jingfei
    Pasunuru, Ramakanth
    Mihaylov, Todor
    Iyer, Srini
    Stoyanov, Veselin
    Kozareva, Zornitsa
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 3558 - 3573
  • [5] Self-Supervised Domain Adaptation for Computer Vision Tasks
    Xu, Jiaolong
    Xiao, Liang
    Lopez, Antonio M.
    IEEE ACCESS, 2019, 7 : 156694 - 156706
  • [6] Vision-Language Models for Vision Tasks: A Survey
    Zhang, Jingyi
    Huang, Jiaxing
    Jin, Sheng
    Lu, Shijian
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (08) : 5625 - 5644
  • [7] Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language
    Baevski, Alexei
    Babu, Arun
    Hsu, Wei-Ning
    Auli, Michael
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 202, 2023, 202
  • [8] Causal Attention for Vision-Language Tasks
    Yang, Xu
    Zhang, Hanwang
    Qi, Guojun
    Cai, Jianfei
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 9842 - 9852
  • [9] Context Autoencoder for Self-supervised Representation Learning
    Xiaokang Chen
    Mingyu Ding
    Xiaodi Wang
    Ying Xin
    Shentong Mo
    Yunhao Wang
    Shumin Han
    Ping Luo
    Gang Zeng
    Jingdong Wang
    International Journal of Computer Vision, 2024, 132 : 208 - 223
  • [10] Improvements to context based self-supervised learning
    Mundhenk, T. Nathan
    Ho, Daniel
    Chen, Barry Y.
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 9339 - 9348