Towards More Unified In-context Visual Understanding

被引：0

作者：

Sheng, Dianmo ^{[1
,2
]}

Chen, Dongdong ^{[3
]}

Tan, Zhentao ^{[1
,2
]}

Liu, Qiankun ^{[4
]}

Chu, Qi ^{[1
,2
]}

Bao, Jianmin ^{[3
]}

Gong, Tao ^{[1
,2
]}

Liu, Bin ^{[1
,2
]}

Xu, Shengwei ^{[5
]}

Yu, Nenghai ^{[1
,2
]}

机构：

[1] Univ Sci & Technol China, Sch Cyber Sci & Technol, Hefei, Anhui, Peoples R China

[2] Anhui Prov Key Lab Digital Secur, Hefei, Anhui, Peoples R China

[3] Microsoft, Redmond, WA USA

[4] Beijing Inst Technol, Beijing, Peoples R China

[5] Beijing Elect Sci & Technol Inst, Beijing, Peoples R China

来源：

2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2024年

基金：

中国国家自然科学基金;

关键词：

D O I：

10.1109/CVPR52733.2024.01269

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The rapid advancement of large language models ( LLMs) has accelerated the emergence of in-context learning (ICL) as a cutting-edge approach in the natural language processing domain. Recently, ICL has been employed in visual understanding tasks, such as semantic segmentation and image captioning, yielding promising results. However, existing visual ICL framework can not enable producing content across multiple modalities, which limits their potential usage scenarios. To address this issue, we present a new ICL framework for visual understanding with multi-modal output enabled. First, we quantize and embed both text and visual prompt into a unified representational space, structured as interleaved in-context sequences. Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them, facilitating incontext learning. Thanks to this design, the model is capable of handling incontext vision understanding tasks with multimodal output in a unified pipeline. Experimental re-sults demonstrate that our model achieves competitive performance compared with specialized models and previous ICL baselines. Overall, our research takes a further step toward unified multimodal in-context learning.

引用

页码：13362 / 13372

页数：11

共 50 条

[1] Towards In-context Scene Understanding
Balazevic, Ivana
Steiner, David
Parthasarathy, Nikhil
Arandjelovic, Relja
Henaff, Olivier J.
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[2] Instruct Me More! Random Prompting for Visual In-Context Learning
Zhang, Jiahao
Wang, Bowen
Li, Liangzhi
Nakashima, Yuta
Nagahara, Hajime
arXiv, 2023,
[3] Unified Demonstration Retriever for In-Context Learning
Li, Xiaonan
Lv, Kai
Yan, Hang
Lin, Tianyang
Wei, Zhu
Ni, Yuan
Xie, Guotong
Wang, Xiaoling
Qiu, Xipeng
PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 4644 - 4668
[4] Skeleton-in-Context: Unified Skeleton Sequence Modeling with In-Context Learning
Wang, Xinshun
Fang, Zhongbin
Li, Xia
Li, Xiangtai
Chen, Chen
Liu, Mengyuan
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 2436 - 2446
[5] What Makes Good Examples for Visual In-Context Learning?
Zhang, Yuanhan
Zhou, Kaiyang
Liu, Ziwei
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[6] In-Context In-Context Learning with Transformer Neural Processes
Ashman, Matthew
Diaconu, Cristiana
Weller, Adrian
Turner, Richard E.
SYMPOSIUM ON ADVANCES IN APPROXIMATE BAYESIAN INFERENCE, 2024, 253 : 1 - 29
[7] Understanding In-Context Learning via Supportive Pretraining Data
Han, Xiaochuang
Simig, Daniel
Mihaylov, Todor
Tsvetkov, Yulia
Celikyilmaz, Asli
Wang, Tianlu
PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 12660 - 12673
[8] Exploring Effective Factors for Improving Visual In-Context Learning
Sun, Yanpeng
Chen, Qiang
Wang, Jian
Wang, Jingdong
Li, Zechao
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2025, 34 : 2147 - 2160
[9] An In-Context Schema Understanding Method for Knowledge Base Question Answering
Liu, Yantao
Li, Zixuan
Jin, Xiaolong
Guo, Yucan
Bai, Long
Guan, Saiping
Guo, Jiafeng
Cheng, Xueqi
KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT I, KSEM 2024, 2024, 14884 : 419 - 434
[10] Understanding in-context interaction: An investigation into on-the-go mobile search
Harvey, Morgan
Pointon, Matthew
INFORMATION PROCESSING & MANAGEMENT, 2019, 56 (06)

← 1 2 3 4 5 →