Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

被引:19
|
作者
Li, Hao [1 ]
Zhu, Jinguo [2 ]
Jiang, Xiaohu [3 ]
Zhu, Xizhou [4 ,6 ]
Li, Hongsheng [1 ]
Yuan, Chun [3 ]
Wang, Xiaohua [2 ]
Qiao, Yu [6 ]
Wang, Xiaogang [1 ]
Wang, Wenhai [6 ]
Dai, Jifeng [5 ,6 ]
机构
[1] Chinese Univ Hong Kong, CUHK SenseTime Joint Lab, Hong Kong, Peoples R China
[2] Xi An Jiao Tong Univ, Xian, Peoples R China
[3] Tsinghua Univ, SIGS, Beijing, Peoples R China
[4] SenseTime Res, Beijing, Peoples R China
[5] Tsinghua Univ, Beijing, Peoples R China
[6] Shanghai Artificial Intelligence Lab, Shanghai, Peoples R China
基金
国家重点研发计划;
关键词
D O I
10.1109/CVPR52729.2023.00264
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Despite the remarkable success of foundation models, their task-specific fine-tuning paradigm makes them inconsistent with the goal of general perception modeling. The key to eliminating this inconsistency is to use generalist models for general task modeling. However, existing attempts at generalist models are inadequate in both versatility and performance. In this paper, we propose UniPerceiver v2, which is the first generalist model capable of handling major large-scale vision and vision-language tasks with competitive performance. Specifically, images are encoded as general region proposals, while texts are encoded via a Transformer-based language model. The encoded representations are transformed by a task-agnostic decoder. Different tasks are formulated as a unified maximum likelihood estimation problem. We further propose an effective optimization technique named Task-Balanced Gradient Normalization to ensure stable multi-task learning with an unmixed sampling strategy, which is helpful for tasks requiring large batch-size training. After being jointly trained on various tasks, Uni-Perceiver v2 is capable of directly handling downstream tasks without any task-specific adaptation. Results show that Uni-Perceiver v2 outperforms all existing generalist models in both versatility and performance. Meanwhile, compared with the commonly-recognized strong baselines that require tasks-specific finetuning, Uni-Perceiver v2 achieves competitive performance on a broad range of vision and vision-language tasks.
引用
收藏
页码:2691 / 2700
页数:10
相关论文
共 50 条
  • [21] Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training
    Pan, Yingwei
    Li, Yehao
    Luo, Jianjie
    Xu, Jun
    Yao, Ting
    Tao Mei
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 7070 - 7074
  • [22] FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks
    Han, Xiao
    Zhu, Xiatian
    Yu, Licheng
    Zhang, Li
    Song, Yi-Zhe
    Xiang, Tao
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2669 - 2680
  • [23] Semantic Scene Difference Detection in Daily Life Patroling by Mobile Robots using Pre-Trained Large-Scale Vision-Language Model
    Obinata, Yoshiki
    Kawaharazuka, Kento
    Kanazawa, Naoaki
    Yamaguchi, Naoya
    Tsukamoto, Naoto
    Yanokura, Iori
    Kitagawa, Shingo
    Shinjo, Koki
    Okada, Kei
    Inaba, Masayuki
    2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, IROS, 2023, : 3228 - 3233
  • [24] Daily Assistive View Control Learning of Low-Cost Low-Rigidity Robot via Large-Scale Vision-Language Model
    Kawaharazuka, Kento
    Kanazawa, Naoaki
    Obinata, Yoshiki
    Okada, Kei
    Inaba, Masayuki
    2023 IEEE-RAS 22ND INTERNATIONAL CONFERENCE ON HUMANOID ROBOTS, HUMANOIDS, 2023,
  • [25] MammoVLM: A generative large vision-language model for mammography-related diagnostic assistance
    Cao, Zhenjie
    Deng, Zhuo
    Ma, Jie
    Hu, Jintao
    Ma, Lan
    INFORMATION FUSION, 2025, 118
  • [26] Large-Scale Adversarial Training for Vision-and-Language Representation Learning
    Gan, Zhe
    Chen, Yen-Chun
    Li, Linjie
    Zhu, Chen
    Cheng, Yu
    Liu, Jingjing
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [27] TaiSu: A 166M Large-scale High-Quality Dataset for Chinese Vision-Language Pre-training
    Liu, Yulong
    Zhu, Guibo
    Zhu, Bin
    Song, Qi
    Ge, Guojing
    Chen, Haoran
    Qiao, Guanhui
    Peng, Ru
    Wu, Lingxiang
    Wang, Jinqiao
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [28] NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models
    Zhou, Gengze
    Hong, Yicong
    Wang, Zun
    Wang, Xin Eric
    Wu, Qi
    COMPUTER VISION-ECCV 2024, PT VII, 2025, 15065 : 260 - 278
  • [29] USING GPT4V, A VISION-LANGUAGE LARGE LEARNING MODEL (LLM), TO GRADE THE BOSTON BOWEL PREPARATION SCORE
    Lim, Daniel
    Tan, Yu Bin
    Le, Quan
    Ho, Jonas
    Carkarine, Sushmitha
    Chew, Tian Wei
    Sng, Gerald
    Tung, Joshua
    Tan, Jen Hong
    Tan, Damien Meng Yew
    Tan, Chee-Kiat
    GASTROINTESTINAL ENDOSCOPY, 2024, 99 (06) : AB19 - AB20
  • [30] Efficient Prompt Tuning of Large Vision-Language Model for Fine-Grained Ship Classification
    Lan, Long
    Wang, Fengxiang
    Zheng, Xiangtao
    Wang, Zengmao
    Liu, Xinwang
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2025, 63