Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

被引：19

作者：

Li, Hao ^{[1
]}

Zhu, Jinguo ^{[2
]}

Jiang, Xiaohu ^{[3
]}

Zhu, Xizhou ^{[4
,6
]}

Li, Hongsheng ^{[1
]}

Yuan, Chun ^{[3
]}

Wang, Xiaohua ^{[2
]}

Qiao, Yu ^{[6
]}

Wang, Xiaogang ^{[1
]}

Wang, Wenhai ^{[6
]}

Dai, Jifeng ^{[5
,6
]}

机构：

[1] Chinese Univ Hong Kong, CUHK SenseTime Joint Lab, Hong Kong, Peoples R China

[2] Xi An Jiao Tong Univ, Xian, Peoples R China

[3] Tsinghua Univ, SIGS, Beijing, Peoples R China

[4] SenseTime Res, Beijing, Peoples R China

[5] Tsinghua Univ, Beijing, Peoples R China

[6] Shanghai Artificial Intelligence Lab, Shanghai, Peoples R China

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR | 2023年

基金：

国家重点研发计划;

关键词：

D O I：

10.1109/CVPR52729.2023.00264

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Despite the remarkable success of foundation models, their task-specific fine-tuning paradigm makes them inconsistent with the goal of general perception modeling. The key to eliminating this inconsistency is to use generalist models for general task modeling. However, existing attempts at generalist models are inadequate in both versatility and performance. In this paper, we propose UniPerceiver v2, which is the first generalist model capable of handling major large-scale vision and vision-language tasks with competitive performance. Specifically, images are encoded as general region proposals, while texts are encoded via a Transformer-based language model. The encoded representations are transformed by a task-agnostic decoder. Different tasks are formulated as a unified maximum likelihood estimation problem. We further propose an effective optimization technique named Task-Balanced Gradient Normalization to ensure stable multi-task learning with an unmixed sampling strategy, which is helpful for tasks requiring large batch-size training. After being jointly trained on various tasks, Uni-Perceiver v2 is capable of directly handling downstream tasks without any task-specific adaptation. Results show that Uni-Perceiver v2 outperforms all existing generalist models in both versatility and performance. Meanwhile, compared with the commonly-recognized strong baselines that require tasks-specific finetuning, Uni-Perceiver v2 achieves competitive performance on a broad range of vision and vision-language tasks.

引用

页码：2691 / 2700

页数：10

共 50 条

[31] 3D Vision and Language Pretraining with Large-Scale Synthetic Data
Yang, Dejie
Xu, Zhu
Mo, Wentao
Chen, Qingchao
Huang, Siyuan
Liu, Yang
PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 1552 - 1560
[32] A large-scale neurocomputational model of spatial cognition integrating memory with vision
Burkhardt, Micha
Bergelt, Julia
Goenner, Lorenz
Dinkelbach, Helge Ulo
Beuth, Frederik
Schwarz, Alex
Bicanski, Andrej
Burgess, Neil
Hamker, Fred H.
NEURAL NETWORKS, 2023, 167 : 473 - 488
[33] Free-Form Instruction Guided Robotic Navigation Path Planning with Large Vision-Language Model
Du, Yuhao
Wu, Chengzhong
Feng, Mingtao
Luo, Jianqiao
Zhong, Hang
Miao, Zhiqiang
Wang, Yaonan
INTELLIGENT ROBOTICS AND APPLICATIONS, ICIRA 2024, PT IX, 2025, 15209 : 381 - 396
[34] SPREAD: A large-scale, high-fidelity synthetic dataset for multiple forest vision tasks
Feng, Zhengpeng
She, Yihang
Keshav, Srinivasan
ECOLOGICAL INFORMATICS, 2025, 87
[35] Understanding Contexts Inside Robot and Human Manipulation Tasks through Vision-Language Model and Ontology System in Video Streams
Jiang, Chen
Dehghan, Masood
Jagersand, Martin
2020 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2020, : 8366 - 8372
[36] Perceiving the fine-scale urban poverty using street view images through a vision-language model
Wu, Chao
Liang, Yongxiang
Zhao, Minwei
Teng, Mingda
Yue, Han
Ye, Yu
SUSTAINABLE CITIES AND SOCIETY, 2025, 123
[37] A joint reconstruction and model selection approach for large-scale linear inverse modeling (msHyBR v2)
Landman, Malena Sabate
Chung, Julianne
Jiang, Jiahua
Miller, Scot M.
Saibaba, Arvind K.
GEOSCIENTIFIC MODEL DEVELOPMENT, 2024, 17 (23) : 8853 - 8872
[38] RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery
Bazi, Yakoub
Bashmal, Laila
Al Rahhal, Mohamad Mahmoud
Ricci, Riccardo
Melgani, Farid
REMOTE SENSING, 2024, 16 (09)
[39] Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images
Khan, Zaid
Kumar, Vijay B. G.
Schulter, Samuel
Yu, Xiang
Fu, Yun
Chandraker, Manmohan
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 15005 - 15015
[40] VISION WITH EQUILUMINANT COLOR CONTRAST .2. A LARGE-SCALE TECHNIQUE AND OBSERVATIONS
CAVANAGH, P
ADELSON, EH
HEARD, P
PERCEPTION, 1992, 21 (02) : 219 - 226

← 1 2 3 4 5 →