Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

被引:19
|
作者
Li, Hao [1 ]
Zhu, Jinguo [2 ]
Jiang, Xiaohu [3 ]
Zhu, Xizhou [4 ,6 ]
Li, Hongsheng [1 ]
Yuan, Chun [3 ]
Wang, Xiaohua [2 ]
Qiao, Yu [6 ]
Wang, Xiaogang [1 ]
Wang, Wenhai [6 ]
Dai, Jifeng [5 ,6 ]
机构
[1] Chinese Univ Hong Kong, CUHK SenseTime Joint Lab, Hong Kong, Peoples R China
[2] Xi An Jiao Tong Univ, Xian, Peoples R China
[3] Tsinghua Univ, SIGS, Beijing, Peoples R China
[4] SenseTime Res, Beijing, Peoples R China
[5] Tsinghua Univ, Beijing, Peoples R China
[6] Shanghai Artificial Intelligence Lab, Shanghai, Peoples R China
基金
国家重点研发计划;
关键词
D O I
10.1109/CVPR52729.2023.00264
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Despite the remarkable success of foundation models, their task-specific fine-tuning paradigm makes them inconsistent with the goal of general perception modeling. The key to eliminating this inconsistency is to use generalist models for general task modeling. However, existing attempts at generalist models are inadequate in both versatility and performance. In this paper, we propose UniPerceiver v2, which is the first generalist model capable of handling major large-scale vision and vision-language tasks with competitive performance. Specifically, images are encoded as general region proposals, while texts are encoded via a Transformer-based language model. The encoded representations are transformed by a task-agnostic decoder. Different tasks are formulated as a unified maximum likelihood estimation problem. We further propose an effective optimization technique named Task-Balanced Gradient Normalization to ensure stable multi-task learning with an unmixed sampling strategy, which is helpful for tasks requiring large batch-size training. After being jointly trained on various tasks, Uni-Perceiver v2 is capable of directly handling downstream tasks without any task-specific adaptation. Results show that Uni-Perceiver v2 outperforms all existing generalist models in both versatility and performance. Meanwhile, compared with the commonly-recognized strong baselines that require tasks-specific finetuning, Uni-Perceiver v2 achieves competitive performance on a broad range of vision and vision-language tasks.
引用
收藏
页码:2691 / 2700
页数:10
相关论文
共 50 条
  • [31] 3D Vision and Language Pretraining with Large-Scale Synthetic Data
    Yang, Dejie
    Xu, Zhu
    Mo, Wentao
    Chen, Qingchao
    Huang, Siyuan
    Liu, Yang
    PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 1552 - 1560
  • [32] A large-scale neurocomputational model of spatial cognition integrating memory with vision
    Burkhardt, Micha
    Bergelt, Julia
    Goenner, Lorenz
    Dinkelbach, Helge Ulo
    Beuth, Frederik
    Schwarz, Alex
    Bicanski, Andrej
    Burgess, Neil
    Hamker, Fred H.
    NEURAL NETWORKS, 2023, 167 : 473 - 488
  • [33] Free-Form Instruction Guided Robotic Navigation Path Planning with Large Vision-Language Model
    Du, Yuhao
    Wu, Chengzhong
    Feng, Mingtao
    Luo, Jianqiao
    Zhong, Hang
    Miao, Zhiqiang
    Wang, Yaonan
    INTELLIGENT ROBOTICS AND APPLICATIONS, ICIRA 2024, PT IX, 2025, 15209 : 381 - 396
  • [34] SPREAD: A large-scale, high-fidelity synthetic dataset for multiple forest vision tasks
    Feng, Zhengpeng
    She, Yihang
    Keshav, Srinivasan
    ECOLOGICAL INFORMATICS, 2025, 87
  • [35] Understanding Contexts Inside Robot and Human Manipulation Tasks through Vision-Language Model and Ontology System in Video Streams
    Jiang, Chen
    Dehghan, Masood
    Jagersand, Martin
    2020 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2020, : 8366 - 8372
  • [36] Perceiving the fine-scale urban poverty using street view images through a vision-language model
    Wu, Chao
    Liang, Yongxiang
    Zhao, Minwei
    Teng, Mingda
    Yue, Han
    Ye, Yu
    SUSTAINABLE CITIES AND SOCIETY, 2025, 123
  • [37] A joint reconstruction and model selection approach for large-scale linear inverse modeling (msHyBR v2)
    Landman, Malena Sabate
    Chung, Julianne
    Jiang, Jiahua
    Miller, Scot M.
    Saibaba, Arvind K.
    GEOSCIENTIFIC MODEL DEVELOPMENT, 2024, 17 (23) : 8853 - 8872
  • [38] RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery
    Bazi, Yakoub
    Bashmal, Laila
    Al Rahhal, Mohamad Mahmoud
    Ricci, Riccardo
    Melgani, Farid
    REMOTE SENSING, 2024, 16 (09)
  • [39] Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images
    Khan, Zaid
    Kumar, Vijay B. G.
    Schulter, Samuel
    Yu, Xiang
    Fu, Yun
    Chandraker, Manmohan
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 15005 - 15015
  • [40] VISION WITH EQUILUMINANT COLOR CONTRAST .2. A LARGE-SCALE TECHNIQUE AND OBSERVATIONS
    CAVANAGH, P
    ADELSON, EH
    HEARD, P
    PERCEPTION, 1992, 21 (02) : 219 - 226