SINC: Self-Supervised In-Context Learning for Vision-Language Tasks

被引:0
|
作者
Chen, Yi-Syuan [1 ]
Song, Yun-Zhu [1 ]
Yeo, Cheng Yu [1 ]
Liu, Bei [2 ]
Fu, Jianlong [2 ]
Shuai, Hong-Han [1 ]
机构
[1] Natl Yang Ming Chiao Tung Univ, Hsinchu, Taiwan
[2] Microsoft Res Asia, Beijing, Peoples R China
关键词
D O I
10.1109/ICCV51070.2023.01415
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large Pre-trained Transformers exhibit an intriguing capacity for in-context learning. Without gradient updates, these models can rapidly construct new predictors from demonstrations presented in the inputs. Recent works promote this ability in the vision-language domain by incorporating visual information into large language models that can already make in-context predictions. However, these methods could inherit issues in the language domain, such as template sensitivity and hallucination. Also, the scale of these language models raises a significant demand for computations, making learning and operating these models resource-intensive. To this end, we raise a question: "How can we enable in-context learning without relying on the intrinsic in-context ability of large language models?". To answer it, we propose a succinct and general framework, Self-supervised IN-Context learning (SINC), that introduces a meta-model to learn on self-supervised prompts consisting of tailored demonstrations. The learned models can be transferred to downstream tasks for making incontext predictions on-the-fly. Extensive experiments show that SINC outperforms gradient-based methods in various vision-language tasks under few-shot settings. Furthermore, the designs of SINC help us investigate the benefits of in-context learning across different tasks, and the analysis further reveals the essential components for the emergence of in-context learning in the vision-language domain.
引用
收藏
页码:15384 / 15396
页数:13
相关论文
共 50 条
  • [21] Self-Supervised 3-D Semantic Representation Learning for Vision-and-Language Navigation
    Tan, Sinan
    Sima, Kuankuan
    Wang, Dunzheng
    Ge, Mengmeng
    Guo, Di
    Liu, Huaping
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, : 1 - 14
  • [22] IN-CONTEXT LANGUAGE LEARNING: ARCHITECTURES AND ALGORITHMS
    Akyürek, Ekin
    Wang, Bailin
    Kim, Yoon
    Andreas, Jacob
    arXiv,
  • [23] Language Features Matter: Effective Language Representations for Vision-Language Tasks
    Burns, Andrea
    Tan, Reuben
    Saenko, Kate
    Sclaroff, Stan
    Plummer, Bryan A.
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 7473 - 7482
  • [24] Siamese Image Modeling for Self-Supervised Vision Representation Learning
    Tao, Chenxin
    Zhu, Xizhou
    Su, Weijie
    Huang, Gao
    Li, Bin
    Zhou, Jie
    Qiao, Yu
    Wang, Xiaogang
    Dai, Jifeng
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2132 - 2141
  • [25] Dissecting self-supervised learning methods for surgical computer vision
    Ramesh, Sanat
    Srivastav, Vinkle
    Alapatt, Deepak
    Yu, Tong
    Murali, Aditya
    Sestini, Luca
    Nwoye, Chinedu Innocent
    Hamoud, Idris
    Sharma, Saurav
    Fleurentin, Antoine
    Exarchakis, Georgios
    Karargyris, Alexandros
    Padoy, Nicolas
    MEDICAL IMAGE ANALYSIS, 2023, 88
  • [26] Weakly Supervised Grounding for VQA in Vision-Language Transformers
    Khan, Aisha Urooj
    Kuehne, Hilde
    Gan, Chuang
    Lobo, Niels Da Vitoria
    Shah, Mubarak
    COMPUTER VISION - ECCV 2022, PT XXXV, 2022, 13695 : 652 - 670
  • [27] Learning to Prompt for Vision-Language Models
    Kaiyang Zhou
    Jingkang Yang
    Chen Change Loy
    Ziwei Liu
    International Journal of Computer Vision, 2022, 130 : 2337 - 2348
  • [28] Learning to Prompt for Vision-Language Models
    Zhou, Kaiyang
    Yang, Jingkang
    Loy, Chen Change
    Liu, Ziwei
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2022, 130 (09) : 2337 - 2348
  • [29] Adapting vision-language AI models to cardiology tasks
    Arnaout, Rima
    NATURE MEDICINE, 2024,
  • [30] Self-Supervised Material and Texture Representation Learning for Remote Sensing Tasks
    Akiva, Peri
    Purri, Matthew
    Leotta, Matthew
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 8193 - 8205