Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

被引:19
|
作者
Li, Hao [1 ]
Zhu, Jinguo [2 ]
Jiang, Xiaohu [3 ]
Zhu, Xizhou [4 ,6 ]
Li, Hongsheng [1 ]
Yuan, Chun [3 ]
Wang, Xiaohua [2 ]
Qiao, Yu [6 ]
Wang, Xiaogang [1 ]
Wang, Wenhai [6 ]
Dai, Jifeng [5 ,6 ]
机构
[1] Chinese Univ Hong Kong, CUHK SenseTime Joint Lab, Hong Kong, Peoples R China
[2] Xi An Jiao Tong Univ, Xian, Peoples R China
[3] Tsinghua Univ, SIGS, Beijing, Peoples R China
[4] SenseTime Res, Beijing, Peoples R China
[5] Tsinghua Univ, Beijing, Peoples R China
[6] Shanghai Artificial Intelligence Lab, Shanghai, Peoples R China
基金
国家重点研发计划;
关键词
D O I
10.1109/CVPR52729.2023.00264
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Despite the remarkable success of foundation models, their task-specific fine-tuning paradigm makes them inconsistent with the goal of general perception modeling. The key to eliminating this inconsistency is to use generalist models for general task modeling. However, existing attempts at generalist models are inadequate in both versatility and performance. In this paper, we propose UniPerceiver v2, which is the first generalist model capable of handling major large-scale vision and vision-language tasks with competitive performance. Specifically, images are encoded as general region proposals, while texts are encoded via a Transformer-based language model. The encoded representations are transformed by a task-agnostic decoder. Different tasks are formulated as a unified maximum likelihood estimation problem. We further propose an effective optimization technique named Task-Balanced Gradient Normalization to ensure stable multi-task learning with an unmixed sampling strategy, which is helpful for tasks requiring large batch-size training. After being jointly trained on various tasks, Uni-Perceiver v2 is capable of directly handling downstream tasks without any task-specific adaptation. Results show that Uni-Perceiver v2 outperforms all existing generalist models in both versatility and performance. Meanwhile, compared with the commonly-recognized strong baselines that require tasks-specific finetuning, Uni-Perceiver v2 achieves competitive performance on a broad range of vision and vision-language tasks.
引用
收藏
页码:2691 / 2700
页数:10
相关论文
共 50 条
  • [41] VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
    Wang, Wenhai
    Chen, Zhe
    Chen, Xiaokang
    Wu, Jiannan
    Zhu, Xizhou
    Zeng, Gang
    Luo, Ping
    Lu, Tong
    Zhou, Jie
    Qiao, Yu
    Dai, Jifeng
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [42] RibSeg v2: A Large-Scale Benchmark for Rib Labeling and Anatomical Centerline Extraction
    Jin, Liang
    Gu, Shixuan
    Wei, Donglai
    Adhinarta, Jason Ken
    Kuang, Kaiming
    Zhang, Yongjie Jessica
    Pfister, Hanspeter
    Ni, Bingbing
    Yang, Jiancheng
    Li, Ming
    IEEE TRANSACTIONS ON MEDICAL IMAGING, 2024, 43 (01) : 570 - 581
  • [43] Multi-Text Guidance Is Important: Multi-Modality Image Fusion via Large Generative Vision-Language Model
    Wang, Zeyu
    Zhao, Libo
    Zhang, Jizheng
    Song, Rui
    Song, Haiyu
    Meng, Jiana
    Wang, Shidong
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025,
  • [44] An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
    Chen, Liang
    Zhao, Haozhe
    Liu, Tianyu
    Bai, Shuai
    Lin, Junyang
    Zhou, Chang
    Chang, Baobao
    COMPUTER VISION - ECCV 2024, PT LXXXI, 2025, 15139 : 19 - 35
  • [45] Pix2Planning: End-to-End Planning by Vision-language Model for Autonomous Driving on Carla Simulator
    Mu, Xiangru
    Qin, Tong
    Zhang, Songan
    Xu, Chunjing
    Yang, Ming
    2024 35TH IEEE INTELLIGENT VEHICLES SYMPOSIUM, IEEE IV 2024, 2024, : 2383 - 2390
  • [46] Enhancing Multi-Label Deep Hashing for Image and Audio With Joint Internal Global Loss Constraints and Large Vision-Language Model
    Liu, Ye
    Pan, Yan
    Yin, Jian
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 2550 - 2554
  • [47] B-AVIBench: Toward Evaluating the Robustness of Large Vision-Language Model on Black-Box Adversarial Visual-Instructions
    Zhang, Hao
    Shao, Wenqi
    Liu, Hong
    Ma, Yongqiang
    Luo, Ping
    Qiao, Yu
    Zheng, Nanning
    Zhang, Kaipeng
    IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2025, 20 : 1434 - 1446
  • [48] Google Landmarks Dataset v2 A Large-Scale Benchmark for Instance-Level Recognition and Retrieval
    Weyand, Tobias
    Araujo, Andre
    Cao, Bingyi
    Sim, Jack
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 2572 - 2581
  • [49] The Casual Conversations v2 Dataset A diverse, large benchmark for measuring fairness and robustness in audio/vision/speech models
    Porgali, Bilal
    Albiero, Vitor
    Ryda, Jordan
    Ferrer, Cristian Canton
    Hazirbas, Caner
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW, 2023, : 10 - 17
  • [50] Assessment of optogenetically-driven strategies for prosthetic restoration of cortical vision in large-scale neural simulation of V1
    Jan Antolik
    Quentin Sabatier
    Charlie Galle
    Yves Frégnac
    Ryad Benosman
    Scientific Reports, 11