LMEye: An Interactive Perception Network for Large Language Models

被引:4
|
作者
Li, Yunxin [1 ]
Hu, Baotian [1 ]
Chen, Xinyu [1 ]
Ma, Lin [2 ]
Xu, Yong [1 ]
Zhang, Min [1 ]
机构
[1] Harbin Inst Technol, Dept Comp Sci & Technol, Shenzhen 518000, Peoples R China
[2] Meituan, Beijing 100102, Peoples R China
基金
中国国家自然科学基金;
关键词
Visualization; Task analysis; Data models; Tuning; Large language models; Training; Cognition; Multimodal large language models (MLLMs); visual-language learning; interactive perception network;
D O I
10.1109/TMM.2024.3428317
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Current efficient approaches to building Multimodal Large Language Models (MLLMs) mainly incorporate visual information into LLMs with a simple visual mapping network such as a linear projection layer, a multilayer perceptron (MLP), or Q-former from BLIP-2. Such networks project the image feature once and do not consider the interaction between the image and the human inputs. Hence, the obtained visual information without being connected to human intention may be inadequate for LLMs to generate intention-following responses, which we refer to as static visual information. To alleviate this issue, our paper introduces LMEye, a human-like eye with a play-and-plug interactive perception network, designed to enable dynamic interaction between LLMs and external visual information. It can allow the LLM to request the desired visual information aligned with various human instructions, which we term dynamic visual information acquisition. Specifically, LMEye consists of a simple visual mapping network to provide the basic perception of an image for LLMs. It also contains additional modules responsible for acquiring requests from LLMs, performing request-based visual information seeking, and transmitting the resulting interacted visual information to LLMs, respectively. In this way, LLMs act to understand the human query, deliver the corresponding request to the request-based visual information interaction module, and generate the response based on the interleaved multimodal information. We evaluate LMEye through extensive experiments on multimodal benchmarks, demonstrating that it significantly improves zero-shot performances on various multimodal tasks compared to previous methods, with fewer parameters. Moreover, we also verify its effectiveness and scalability on various language models and video understanding, respectively.
引用
收藏
页码:10952 / 10964
页数:13
相关论文
共 50 条
  • [21] Large Language Models for Zero Touch Network Configuration Management
    Lira, Oscar G.
    Caicedo, Oscar M.
    da Fonseca, Nelson L. S.
    IEEE COMMUNICATIONS MAGAZINE, 2024,
  • [22] Automation of Network Configuration Generation using Large Language Models
    Chakraborty, Supratim
    Chitta, Nithin
    Sundaresan, Rajesh
    2024 20TH INTERNATIONAL CONFERENCE ON NETWORK AND SERVICE MANAGEMENT, CNSM 2024, 2024,
  • [23] Interactive computer-aided diagnosis on medical image using large language models
    Sheng Wang
    Zihao Zhao
    Xi Ouyang
    Tianming Liu
    Qian Wang
    Dinggang Shen
    Communications Engineering, 3 (1):
  • [24] Context-Driven Interactive Query Simulations Based on Generative Large Language Models
    Engelmann, Bjoern
    Breuer, Timo
    Friese, Jana Isabelle
    Schaer, Philipp
    Fuhr, Norbert
    ADVANCES IN INFORMATION RETRIEVAL, ECIR 2024, PT II, 2024, 14609 : 173 - 188
  • [25] LLMR: Real-time Prompting of Interactive Worlds using Large Language Models
    De la Torre, Fernanda
    Fang, Cathy Mengying
    Huang, Han
    Banburski-Fahey, Andrzej
    Fernandez, Judith Amores
    Lanier, Jaron
    PROCEEDINGS OF THE 2024 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYTEMS (CHI 2024), 2024,
  • [26] Interactive Neurocognitive Models of Language Processing
    Shestakova, Svitlana
    Oliinyk, Liubov
    Rebryk, Nataliia
    Yanchyshyn, Anatolii
    Yushchyshyna, Oksana
    Hnatyuk, Myroslava
    REVISTA ROMANEASCA PENTRU EDUCATIE MULTIDIMENSIONALA, 2022, 14 (04): : 274 - 297
  • [27] AutoPlan: Automatic Planning of Interactive Decision-Making Tasks With Large Language Models
    Ouyang, Siqi
    Li, Lei
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 3114 - 3128
  • [28] Large Language Models are Not Models of Natural Language: They are Corpus Models
    Veres, Csaba
    IEEE ACCESS, 2022, 10 : 61970 - 61979
  • [29] INTERACTIVE - A NETWORK SIMULATION LANGUAGE FOR MICROCOMPUTERS
    MOURANT, RR
    LAKSHAMANAN, R
    SIMULATION, 1984, 42 (01) : 38 - 38
  • [30] Large Language Models
    Vargas, Diego Collarana
    Katsamanis, Nassos
    ERCIM NEWS, 2024, (136): : 12 - 13