InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

被引:282
|
作者
Wang, Wenhai [1 ]
Dai, Jifeng [1 ,2 ]
Chen, Zhe [1 ,3 ]
Huang, Zhenhang [1 ]
Li, Zhiqi [1 ,3 ]
Zhu, Xizhou [4 ]
Hu, Xiaowei [1 ]
Lu, Tong [3 ]
Lu, Lewei [4 ]
Li, Hongsheng [5 ]
Wang, Xiaogang [4 ,5 ]
Qiao, Yu [1 ]
机构
[1] Shanghai AI Lab, Shanghai, Peoples R China
[2] Tsinghua Univ, Beijing, Peoples R China
[3] Nanjing Univ, Nanjing, Peoples R China
[4] SenseTime Res, Hong Kong, Peoples R China
[5] Chinese Univ Hong Kong, Hong Kong, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
10.1109/CVPR52729.2023.01385
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Compared to the great progress of large-scale vision transformers (ViTs) in recent years, large-scale models based on convolutional neural networks (CNNs) are still in an early state. This work presents a new large-scale CNN-based foundation model, termed InternImage, which can obtain the gain from increasing parameters and training data like ViTs. Different from the recent CNNs that focus on large dense kernels, InternImage takes deformable convolution as the core operator, so that our model not only has the large effective receptive field required for downstream tasks such as detection and segmentation, but also has the adaptive spatial aggregation conditioned by input and task information. As a result, the proposed InternImage reduces the strict inductive bias of traditional CNNs and makes it possible to learn stronger and more robust patterns with large-scale parameters from massive data like ViTs. The effectiveness of our model is proven on challenging benchmarks including ImageNet, COCO, and ADE20K. It is worth mentioning that InternImage-H achieved a new record 65.4 mAP on COCO test-dev and 62.9 mIoU on ADE20K, outperforming current leading CNNs and ViTs.
引用
收藏
页码:14408 / 14419
页数:12
相关论文
共 50 条
  • [1] Can surgical computer vision benefit from large-scale visual foundation models?
    Rabbani, Navid
    Bartoli, Adrien
    INTERNATIONAL JOURNAL OF COMPUTER ASSISTED RADIOLOGY AND SURGERY, 2024, 19 (06) : 1157 - 1163
  • [2] A Large-Scale Evaluation of Speech Foundation Models
    Yang, Shu-wen
    Chang, Heng-Jui
    Huang, Zili
    Liu, Andy T.
    Lai, Cheng-, I
    Wu, Haibin
    Shi, Jiatong
    Chang, Xuankai
    Tsai, Hsiang-Sheng
    Huang, Wen-Chin
    Feng, Tzu-hsun
    Chi, Po-Han
    Lin, Yist Y.
    Chuang, Yung-Sung
    Huang, Tzu-Hsien
    Tseng, Wei-Cheng
    Lakhotia, Kushal
    Li, Shang-Wen
    Mohamed, Abdelrahman
    Watanabe, Shinji
    Lee, Hung-yi
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 2884 - 2899
  • [3] Training Large-Scale Foundation Models on Emerging AI Chips
    Muhamed, Aashiq
    Bock, Christian
    Solanki, Rahul
    Park, Youngsuk
    Wang, Yida
    Huan, Jun
    PROCEEDINGS OF THE 29TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2023, 2023, : 5821 - 5822
  • [4] Towards Artwork Explanation in Large-scale Vision Language Models
    Hayashi, Kazuki
    Sakai, Yusuke
    Kamigaito, Hidetaka
    Hayashi, Katsuhiko
    Watanabe, Taro
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2: SHORT PAPERS, 2024, : 705 - 729
  • [5] FloodCastBench: A Large-Scale Dataset and Foundation Models for Flood Modeling and Forecasting
    Xu, Qingsong
    Shi, Yilei
    Zhao, Jie
    Zhu, Xiao Xiang
    SCIENTIFIC DATA, 2025, 12 (01)
  • [6] Exploring Large-Scale Solar Deployment in DOE's SunShot Vision Study
    Drury, Easan
    Brinkman, Greg
    Denholm, Paul
    Margolis, Robert
    Mowers, Matthew
    2012 38TH IEEE PHOTOVOLTAIC SPECIALISTS CONFERENCE (PVSC), 2012, : 763 - 768
  • [7] NetBench: A Large-Scale and Comprehensive Network Traffic Benchmark Dataset for Foundation Models
    Qian, Chen
    Li, Xiaochang
    Wang, Qineng
    Zhou, Gang
    Shao, Huajie
    PROCEEDINGS 2024 IEEE INTERNATIONAL WORKSHOP ON FOUNDATION MODELS FOR CYBER-PHYSICAL SYSTEMS & INTERNET OF THINGS, FMSYS 2024, 2024, : 20 - 25
  • [8] NetBench: A LARGE-SCALE AND COMPREHENSIVE NETWORK TRAFFIC BENCHMARK DATASET FOR FOUNDATION MODELS
    Department of Computer Science William & Mary, United States
    arXiv,
  • [9] LARGE-SCALE NATURAL VISION SIMULATIONS
    LOURENS, T
    PETKOV, N
    KRUIZINGA, P
    FUTURE GENERATION COMPUTER SYSTEMS, 1994, 10 (2-3) : 351 - 358
  • [10] AMD: Automatic Multi-step Distillation of Large-Scale Vision Models
    Han, Cheng
    Wang, Qifan
    Dianat, Sohail A.
    Rabbani, Majid
    Rao, Raghuveer M.
    Fang, Yi
    Guan, Qiang
    Huang, Lifu
    Liu, Dongfang
    COMPUTER VISION - ECCV 2024, PT LXV, 2025, 15123 : 431 - 450