Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

被引：19

作者：

Li, Hao ^{[1
]}

Zhu, Jinguo ^{[2
]}

Jiang, Xiaohu ^{[3
]}

Zhu, Xizhou ^{[4
,6
]}

Li, Hongsheng ^{[1
]}

Yuan, Chun ^{[3
]}

Wang, Xiaohua ^{[2
]}

Qiao, Yu ^{[6
]}

Wang, Xiaogang ^{[1
]}

Wang, Wenhai ^{[6
]}

Dai, Jifeng ^{[5
,6
]}

机构：

[1] Chinese Univ Hong Kong, CUHK SenseTime Joint Lab, Hong Kong, Peoples R China

[2] Xi An Jiao Tong Univ, Xian, Peoples R China

[3] Tsinghua Univ, SIGS, Beijing, Peoples R China

[4] SenseTime Res, Beijing, Peoples R China

[5] Tsinghua Univ, Beijing, Peoples R China

[6] Shanghai Artificial Intelligence Lab, Shanghai, Peoples R China

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR | 2023年

基金：

国家重点研发计划;

关键词：

D O I：

10.1109/CVPR52729.2023.00264

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Despite the remarkable success of foundation models, their task-specific fine-tuning paradigm makes them inconsistent with the goal of general perception modeling. The key to eliminating this inconsistency is to use generalist models for general task modeling. However, existing attempts at generalist models are inadequate in both versatility and performance. In this paper, we propose UniPerceiver v2, which is the first generalist model capable of handling major large-scale vision and vision-language tasks with competitive performance. Specifically, images are encoded as general region proposals, while texts are encoded via a Transformer-based language model. The encoded representations are transformed by a task-agnostic decoder. Different tasks are formulated as a unified maximum likelihood estimation problem. We further propose an effective optimization technique named Task-Balanced Gradient Normalization to ensure stable multi-task learning with an unmixed sampling strategy, which is helpful for tasks requiring large batch-size training. After being jointly trained on various tasks, Uni-Perceiver v2 is capable of directly handling downstream tasks without any task-specific adaptation. Results show that Uni-Perceiver v2 outperforms all existing generalist models in both versatility and performance. Meanwhile, compared with the commonly-recognized strong baselines that require tasks-specific finetuning, Uni-Perceiver v2 achieves competitive performance on a broad range of vision and vision-language tasks.

引用

页码：2691 / 2700

页数：10

共 50 条

[41] VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
Wang, Wenhai
Chen, Zhe
Chen, Xiaokang
Wu, Jiannan
Zhu, Xizhou
Zeng, Gang
Luo, Ping
Lu, Tong
Zhou, Jie
Qiao, Yu
Dai, Jifeng
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[42] RibSeg v2: A Large-Scale Benchmark for Rib Labeling and Anatomical Centerline Extraction
Jin, Liang
Gu, Shixuan
Wei, Donglai
Adhinarta, Jason Ken
Kuang, Kaiming
Zhang, Yongjie Jessica
Pfister, Hanspeter
Ni, Bingbing
Yang, Jiancheng
Li, Ming
IEEE TRANSACTIONS ON MEDICAL IMAGING, 2024, 43 (01) : 570 - 581
[43] Multi-Text Guidance Is Important: Multi-Modality Image Fusion via Large Generative Vision-Language Model
Wang, Zeyu
Zhao, Libo
Zhang, Jizheng
Song, Rui
Song, Haiyu
Meng, Jiana
Wang, Shidong
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025,
[44] An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
Chen, Liang
Zhao, Haozhe
Liu, Tianyu
Bai, Shuai
Lin, Junyang
Zhou, Chang
Chang, Baobao
COMPUTER VISION - ECCV 2024, PT LXXXI, 2025, 15139 : 19 - 35
[45] Pix2Planning: End-to-End Planning by Vision-language Model for Autonomous Driving on Carla Simulator
Mu, Xiangru
Qin, Tong
Zhang, Songan
Xu, Chunjing
Yang, Ming
2024 35TH IEEE INTELLIGENT VEHICLES SYMPOSIUM, IEEE IV 2024, 2024, : 2383 - 2390
[46] Enhancing Multi-Label Deep Hashing for Image and Audio With Joint Internal Global Loss Constraints and Large Vision-Language Model
Liu, Ye
Pan, Yan
Yin, Jian
IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 2550 - 2554
[47] B-AVIBench: Toward Evaluating the Robustness of Large Vision-Language Model on Black-Box Adversarial Visual-Instructions
Zhang, Hao
Shao, Wenqi
Liu, Hong
Ma, Yongqiang
Luo, Ping
Qiao, Yu
Zheng, Nanning
Zhang, Kaipeng
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2025, 20 : 1434 - 1446
[48] Google Landmarks Dataset v2 A Large-Scale Benchmark for Instance-Level Recognition and Retrieval
Weyand, Tobias
Araujo, Andre
Cao, Bingyi
Sim, Jack
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 2572 - 2581
[49] The Casual Conversations v2 Dataset A diverse, large benchmark for measuring fairness and robustness in audio/vision/speech models
Porgali, Bilal
Albiero, Vitor
Ryda, Jordan
Ferrer, Cristian Canton
Hazirbas, Caner
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW, 2023, : 10 - 17
[50] Assessment of optogenetically-driven strategies for prosthetic restoration of cortical vision in large-scale neural simulation of V1
Jan Antolik
Quentin Sabatier
Charlie Galle
Yves Frégnac
Ryad Benosman
Scientific Reports, 11

← 1 2 3 4 5 →