Towards More Unified In-context Visual Understanding

被引:0
|
作者
Sheng, Dianmo [1 ,2 ]
Chen, Dongdong [3 ]
Tan, Zhentao [1 ,2 ]
Liu, Qiankun [4 ]
Chu, Qi [1 ,2 ]
Bao, Jianmin [3 ]
Gong, Tao [1 ,2 ]
Liu, Bin [1 ,2 ]
Xu, Shengwei [5 ]
Yu, Nenghai [1 ,2 ]
机构
[1] Univ Sci & Technol China, Sch Cyber Sci & Technol, Hefei, Anhui, Peoples R China
[2] Anhui Prov Key Lab Digital Secur, Hefei, Anhui, Peoples R China
[3] Microsoft, Redmond, WA USA
[4] Beijing Inst Technol, Beijing, Peoples R China
[5] Beijing Elect Sci & Technol Inst, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
10.1109/CVPR52733.2024.01269
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The rapid advancement of large language models ( LLMs) has accelerated the emergence of in-context learning (ICL) as a cutting-edge approach in the natural language processing domain. Recently, ICL has been employed in visual understanding tasks, such as semantic segmentation and image captioning, yielding promising results. However, existing visual ICL framework can not enable producing content across multiple modalities, which limits their potential usage scenarios. To address this issue, we present a new ICL framework for visual understanding with multi-modal output enabled. First, we quantize and embed both text and visual prompt into a unified representational space, structured as interleaved in-context sequences. Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them, facilitating incontext learning. Thanks to this design, the model is capable of handling incontext vision understanding tasks with multimodal output in a unified pipeline. Experimental re-sults demonstrate that our model achieves competitive performance compared with specialized models and previous ICL baselines. Overall, our research takes a further step toward unified multimodal in-context learning.
引用
收藏
页码:13362 / 13372
页数:11
相关论文
共 50 条
  • [1] Towards In-context Scene Understanding
    Balazevic, Ivana
    Steiner, David
    Parthasarathy, Nikhil
    Arandjelovic, Relja
    Henaff, Olivier J.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [2] Instruct Me More! Random Prompting for Visual In-Context Learning
    Zhang, Jiahao
    Wang, Bowen
    Li, Liangzhi
    Nakashima, Yuta
    Nagahara, Hajime
    arXiv, 2023,
  • [3] Unified Demonstration Retriever for In-Context Learning
    Li, Xiaonan
    Lv, Kai
    Yan, Hang
    Lin, Tianyang
    Wei, Zhu
    Ni, Yuan
    Xie, Guotong
    Wang, Xiaoling
    Qiu, Xipeng
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 4644 - 4668
  • [4] Skeleton-in-Context: Unified Skeleton Sequence Modeling with In-Context Learning
    Wang, Xinshun
    Fang, Zhongbin
    Li, Xia
    Li, Xiangtai
    Chen, Chen
    Liu, Mengyuan
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 2436 - 2446
  • [5] What Makes Good Examples for Visual In-Context Learning?
    Zhang, Yuanhan
    Zhou, Kaiyang
    Liu, Ziwei
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [6] In-Context In-Context Learning with Transformer Neural Processes
    Ashman, Matthew
    Diaconu, Cristiana
    Weller, Adrian
    Turner, Richard E.
    SYMPOSIUM ON ADVANCES IN APPROXIMATE BAYESIAN INFERENCE, 2024, 253 : 1 - 29
  • [7] Understanding In-Context Learning via Supportive Pretraining Data
    Han, Xiaochuang
    Simig, Daniel
    Mihaylov, Todor
    Tsvetkov, Yulia
    Celikyilmaz, Asli
    Wang, Tianlu
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 12660 - 12673
  • [8] Exploring Effective Factors for Improving Visual In-Context Learning
    Sun, Yanpeng
    Chen, Qiang
    Wang, Jian
    Wang, Jingdong
    Li, Zechao
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2025, 34 : 2147 - 2160
  • [9] An In-Context Schema Understanding Method for Knowledge Base Question Answering
    Liu, Yantao
    Li, Zixuan
    Jin, Xiaolong
    Guo, Yucan
    Bai, Long
    Guan, Saiping
    Guo, Jiafeng
    Cheng, Xueqi
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT I, KSEM 2024, 2024, 14884 : 419 - 434
  • [10] Understanding in-context interaction: An investigation into on-the-go mobile search
    Harvey, Morgan
    Pointon, Matthew
    INFORMATION PROCESSING & MANAGEMENT, 2019, 56 (06)