Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

被引：19

作者：

Li, Hao ^{[1
]}

Zhu, Jinguo ^{[2
]}

Jiang, Xiaohu ^{[3
]}

Zhu, Xizhou ^{[4
,6
]}

Li, Hongsheng ^{[1
]}

Yuan, Chun ^{[3
]}

Wang, Xiaohua ^{[2
]}

Qiao, Yu ^{[6
]}

Wang, Xiaogang ^{[1
]}

Wang, Wenhai ^{[6
]}

Dai, Jifeng ^{[5
,6
]}

机构：

[1] Chinese Univ Hong Kong, CUHK SenseTime Joint Lab, Hong Kong, Peoples R China

[2] Xi An Jiao Tong Univ, Xian, Peoples R China

[3] Tsinghua Univ, SIGS, Beijing, Peoples R China

[4] SenseTime Res, Beijing, Peoples R China

[5] Tsinghua Univ, Beijing, Peoples R China

[6] Shanghai Artificial Intelligence Lab, Shanghai, Peoples R China

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR | 2023年

基金：

国家重点研发计划;

关键词：

D O I：

10.1109/CVPR52729.2023.00264

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Despite the remarkable success of foundation models, their task-specific fine-tuning paradigm makes them inconsistent with the goal of general perception modeling. The key to eliminating this inconsistency is to use generalist models for general task modeling. However, existing attempts at generalist models are inadequate in both versatility and performance. In this paper, we propose UniPerceiver v2, which is the first generalist model capable of handling major large-scale vision and vision-language tasks with competitive performance. Specifically, images are encoded as general region proposals, while texts are encoded via a Transformer-based language model. The encoded representations are transformed by a task-agnostic decoder. Different tasks are formulated as a unified maximum likelihood estimation problem. We further propose an effective optimization technique named Task-Balanced Gradient Normalization to ensure stable multi-task learning with an unmixed sampling strategy, which is helpful for tasks requiring large batch-size training. After being jointly trained on various tasks, Uni-Perceiver v2 is capable of directly handling downstream tasks without any task-specific adaptation. Results show that Uni-Perceiver v2 outperforms all existing generalist models in both versatility and performance. Meanwhile, compared with the commonly-recognized strong baselines that require tasks-specific finetuning, Uni-Perceiver v2 achieves competitive performance on a broad range of vision and vision-language tasks.

引用

页码：2691 / 2700

页数：10

共 50 条

[1] A generalist vision-language foundation model for diverse biomedical tasks
Zhang, Kai
Zhou, Rong
Adhikarla, Eashan
Yan, Zhiling
Liu, Yixin
Yu, Jun
Liu, Zhengliang
Chen, Xun
Davison, Brian D.
Ren, Hui
Huang, Jing
Chen, Chen
Zhou, Yuyin
Fu, Sunyang
Liu, Wei
Liu, Tianming
Li, Xiang
Chen, Yong
He, Lifang
Zou, James
Li, Quanzheng
Liu, Hongfang
Sun, Lichao
NATURE MEDICINE, 2024, 30 (11) : 3129 - 3141
[2] Automated Quality Evaluation of Large-Scale Benchmark Datasets for Vision-Language Tasks
Zhao, Ruibin
Xie, Zhiwei
Zhuang, Yipeng
L. H. Yu, Philip
INTERNATIONAL JOURNAL OF NEURAL SYSTEMS, 2024, 34 (03)
[3] Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language Tasks
Sammani, Fawaz
Deligiannis, Nikos
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 4636 - 4641
[4] RS5M and GeoRSCLIP: A Large-Scale Vision- Language Dataset and a Large Vision-Language Model for Remote Sensing
Zhang, Zilun
Zhao, Tiancheng
Guo, Yulong
Yin, Jianwei
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
[5] NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks
Sammani, Fawaz
Mukherjee, Tanmoy
Deligiannis, Nikos
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 8312 - 8322
[6] Vary: Scaling up the Vision Vocabulary for Large Vision-Language Model
Wei, Haoran
Kong, Lingyu
Chen, Jinyue
Zhao, Liang
Ge, Zheng
Yang, Jinrong
Sun, Jianjian
Han, Chunrui
Zhang, Xiangyu
COMPUTER VISION-ECCV 2024, PT IV, 2025, 15062 : 408 - 424
[7] Stable and low-precision training for large-scale vision-language models
Wortsman, Mitchell
Dettmers, Tim
Zettlemoyer, Luke
Morcos, Ari
Farhadi, Ali
Schmidt, Ludwig
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[8] SkyEyeGPT: Unifying remote sensing vision-language tasks via instruction tuning with large language model
Zhan, Yang
Xiong, Zhitong
Yuan, Yuan
ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING, 2025, 221 : 64 - 77
[9] e-CLIP: Large-Scale Vision-Language Representation Learning in E-commerce
Shin, Wonyoung
Park, Jonghun
Woo, Taekang
Cho, Yongwoo
Oh, Kwangjin
Song, Hwanjun
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2022, 2022, : 3484 - 3494
[10] Pathologyvlm: a large vision-language model for pathology image understanding
Dawei Dai
Yuanhui Zhang
Qianlan Yang
Long Xu
Xiaojing Shen
Shuyin Xia
Guoyin Wang
Artificial Intelligence Review, 58 (6)

← 1 2 3 4 5 →