UniCode: Learning a Unified Codebook for Multimodal Large Language Models

被引:0
|
作者
Zheng, Sipeng [1 ]
Zhou, Bohan [2 ]
Feng, Yicheng [2 ]
Wang, Ye [1 ]
Lu, Zongqing [1 ,2 ]
机构
[1] Beijing Acad Artificial Intelligence BAAI, Beijing, Peoples R China
[2] Peking Univ, Sch Comp Sci, Beijing, Peoples R China
来源
关键词
Multimodal Learning; Large Model; Visual Generation;
D O I
10.1007/978-3-031-73242-3_24
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we propose UniCode, a novel approach within the domain of multimodal large language models (MLLMs) that learns a unified codebook to efficiently tokenize visual, text, and potentially other types of signals. This innovation addresses a critical limitation in existing MLLMs: their reliance on a text-only codebook, which restricts MLLMs' ability to generate images and texts in a multimodal context. Towards this end, we propose a language-driven iterative training paradigm, coupled with an in-context pre-training task we term "image decompression", enabling our model to interpret compressed visual data and generate high-quality images. The unified codebook empowers our model to extend visual instruction tuning to non-linguistic generation tasks. Moreover, UniCode is adaptable to diverse stacked quantization approaches in order to compress visual signals into a more compact token representation. Despite using significantly fewer parameters and less data during training, UniCode demonstrates promising capabilities in visual reconstruction and generation. It also achieves performance comparable to leading MLLMs across a spectrum of VQA benchmarks.
引用
收藏
页码:426 / 443
页数:18
相关论文
共 50 条
  • [1] UniMEL: A Unified Framework for Multimodal Entity Linking with Large Language Models
    Liu, Qi
    He, Yongyi
    Xu, Tong
    Lian, Defu
    Liu, Che
    Zheng, Zhi
    Chen, Enhong
    International Conference on Information and Knowledge Management, Proceedings, : 1909 - 1919
  • [2] Multimodal Clinical Prediction with Unified Prompts and Pretrained Large-Language Models
    Winston, Caleb
    Winston, Chloe
    Winston, Cailin
    Winston, Claris
    Winston, Cleah
    2024 IEEE 12TH INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS, ICHI 2024, 2024, : 679 - 683
  • [3] Multimodal large language models for inclusive collaboration learning tasks
    Lewis, Armanda
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES: PROCEEDINGS OF THE STUDENT RESEARCH WORKSHOP, 2022, : 202 - 210
  • [4] Conceptual Codebook Learning for Vision-Language Models
    Zhang, Yi
    Yu, Ke
    Wu, Siqi
    He, Zhihai
    COMPUTER VISION - ECCV 2024, PT LXXVII, 2024, 15135 : 235 - 251
  • [5] Cloud-Device Collaborative Learning for Multimodal Large Language Models
    Wang, Guanqun
    Chen, Jiaming
    Liu, Chenxuan
    Zhang, Yuan
    Ma, Junpeng
    Wei, Xinyu
    Zhang, Kevin
    Chong, Maurice
    Zhang, Renrui
    Liu, Yijiang
    Zhang, Shanghang
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 12646 - 12655
  • [6] A survey on multimodal large language models
    Yin, Shukang
    Fu, Chaoyou
    Zhao, Sirui
    Li, Ke
    Sun, Xing
    Xu, Tong
    Chen, Enhong
    NATIONAL SCIENCE REVIEW, 2024, 11 (12)
  • [7] A survey on multimodal large language models
    Shukang Yin
    Chaoyou Fu
    Sirui Zhao
    Ke Li
    Xing Sun
    Tong Xu
    Enhong Chen
    National Science Review, 2024, 11 (12) : 277 - 296
  • [8] Understanding Naturalistic Facial Expressions with Deep Learning and Multimodal Large Language Models
    Bian, Yifan
    Kuester, Dennis
    Liu, Hui
    Krumhuber, Eva G.
    SENSORS, 2024, 24 (01)
  • [9] From Large Language Models to Large Multimodal Models: A Literature Review
    Huang, Dawei
    Yan, Chuan
    Li, Qing
    Peng, Xiaojiang
    APPLIED SCIENCES-BASEL, 2024, 14 (12):
  • [10] A comprehensive survey of large language models and multimodal large models in medicine
    Xiao, Hanguang
    Zhou, Feizhong
    Liu, Xingyue
    Liu, Tianqi
    Li, Zhipeng
    Liu, Xin
    Huang, Xiaoxuan
    INFORMATION FUSION, 2025, 117