Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

被引:19
|
作者
Li, Hao [1 ]
Zhu, Jinguo [2 ]
Jiang, Xiaohu [3 ]
Zhu, Xizhou [4 ,6 ]
Li, Hongsheng [1 ]
Yuan, Chun [3 ]
Wang, Xiaohua [2 ]
Qiao, Yu [6 ]
Wang, Xiaogang [1 ]
Wang, Wenhai [6 ]
Dai, Jifeng [5 ,6 ]
机构
[1] Chinese Univ Hong Kong, CUHK SenseTime Joint Lab, Hong Kong, Peoples R China
[2] Xi An Jiao Tong Univ, Xian, Peoples R China
[3] Tsinghua Univ, SIGS, Beijing, Peoples R China
[4] SenseTime Res, Beijing, Peoples R China
[5] Tsinghua Univ, Beijing, Peoples R China
[6] Shanghai Artificial Intelligence Lab, Shanghai, Peoples R China
基金
国家重点研发计划;
关键词
D O I
10.1109/CVPR52729.2023.00264
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Despite the remarkable success of foundation models, their task-specific fine-tuning paradigm makes them inconsistent with the goal of general perception modeling. The key to eliminating this inconsistency is to use generalist models for general task modeling. However, existing attempts at generalist models are inadequate in both versatility and performance. In this paper, we propose UniPerceiver v2, which is the first generalist model capable of handling major large-scale vision and vision-language tasks with competitive performance. Specifically, images are encoded as general region proposals, while texts are encoded via a Transformer-based language model. The encoded representations are transformed by a task-agnostic decoder. Different tasks are formulated as a unified maximum likelihood estimation problem. We further propose an effective optimization technique named Task-Balanced Gradient Normalization to ensure stable multi-task learning with an unmixed sampling strategy, which is helpful for tasks requiring large batch-size training. After being jointly trained on various tasks, Uni-Perceiver v2 is capable of directly handling downstream tasks without any task-specific adaptation. Results show that Uni-Perceiver v2 outperforms all existing generalist models in both versatility and performance. Meanwhile, compared with the commonly-recognized strong baselines that require tasks-specific finetuning, Uni-Perceiver v2 achieves competitive performance on a broad range of vision and vision-language tasks.
引用
收藏
页码:2691 / 2700
页数:10
相关论文
共 50 条
  • [1] A generalist vision-language foundation model for diverse biomedical tasks
    Zhang, Kai
    Zhou, Rong
    Adhikarla, Eashan
    Yan, Zhiling
    Liu, Yixin
    Yu, Jun
    Liu, Zhengliang
    Chen, Xun
    Davison, Brian D.
    Ren, Hui
    Huang, Jing
    Chen, Chen
    Zhou, Yuyin
    Fu, Sunyang
    Liu, Wei
    Liu, Tianming
    Li, Xiang
    Chen, Yong
    He, Lifang
    Zou, James
    Li, Quanzheng
    Liu, Hongfang
    Sun, Lichao
    NATURE MEDICINE, 2024, 30 (11) : 3129 - 3141
  • [2] Automated Quality Evaluation of Large-Scale Benchmark Datasets for Vision-Language Tasks
    Zhao, Ruibin
    Xie, Zhiwei
    Zhuang, Yipeng
    L. H. Yu, Philip
    INTERNATIONAL JOURNAL OF NEURAL SYSTEMS, 2024, 34 (03)
  • [3] Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language Tasks
    Sammani, Fawaz
    Deligiannis, Nikos
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 4636 - 4641
  • [4] RS5M and GeoRSCLIP: A Large-Scale Vision- Language Dataset and a Large Vision-Language Model for Remote Sensing
    Zhang, Zilun
    Zhao, Tiancheng
    Guo, Yulong
    Yin, Jianwei
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [5] NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks
    Sammani, Fawaz
    Mukherjee, Tanmoy
    Deligiannis, Nikos
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 8312 - 8322
  • [6] Vary: Scaling up the Vision Vocabulary for Large Vision-Language Model
    Wei, Haoran
    Kong, Lingyu
    Chen, Jinyue
    Zhao, Liang
    Ge, Zheng
    Yang, Jinrong
    Sun, Jianjian
    Han, Chunrui
    Zhang, Xiangyu
    COMPUTER VISION-ECCV 2024, PT IV, 2025, 15062 : 408 - 424
  • [7] Stable and low-precision training for large-scale vision-language models
    Wortsman, Mitchell
    Dettmers, Tim
    Zettlemoyer, Luke
    Morcos, Ari
    Farhadi, Ali
    Schmidt, Ludwig
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [8] SkyEyeGPT: Unifying remote sensing vision-language tasks via instruction tuning with large language model
    Zhan, Yang
    Xiong, Zhitong
    Yuan, Yuan
    ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING, 2025, 221 : 64 - 77
  • [9] e-CLIP: Large-Scale Vision-Language Representation Learning in E-commerce
    Shin, Wonyoung
    Park, Jonghun
    Woo, Taekang
    Cho, Yongwoo
    Oh, Kwangjin
    Song, Hwanjun
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2022, 2022, : 3484 - 3494
  • [10] Pathologyvlm: a large vision-language model for pathology image understanding
    Dawei Dai
    Yuanhui Zhang
    Qianlan Yang
    Long Xu
    Xiaojing Shen
    Shuyin Xia
    Guoyin Wang
    Artificial Intelligence Review, 58 (6)