InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

被引：282

作者：

Wang, Wenhai ^{[1
]}

Dai, Jifeng ^{[1
,2
]}

Chen, Zhe ^{[1
,3
]}

Huang, Zhenhang ^{[1
]}

Li, Zhiqi ^{[1
,3
]}

Zhu, Xizhou ^{[4
]}

Hu, Xiaowei ^{[1
]}

Lu, Tong ^{[3
]}

Lu, Lewei ^{[4
]}

Li, Hongsheng ^{[5
]}

Wang, Xiaogang ^{[4
,5
]}

Qiao, Yu ^{[1
]}

机构：

[1] Shanghai AI Lab, Shanghai, Peoples R China

[2] Tsinghua Univ, Beijing, Peoples R China

[3] Nanjing Univ, Nanjing, Peoples R China

[4] SenseTime Res, Hong Kong, Peoples R China

[5] Chinese Univ Hong Kong, Hong Kong, Peoples R China

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年

基金：

中国国家自然科学基金;

关键词：

D O I：

10.1109/CVPR52729.2023.01385

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Compared to the great progress of large-scale vision transformers (ViTs) in recent years, large-scale models based on convolutional neural networks (CNNs) are still in an early state. This work presents a new large-scale CNN-based foundation model, termed InternImage, which can obtain the gain from increasing parameters and training data like ViTs. Different from the recent CNNs that focus on large dense kernels, InternImage takes deformable convolution as the core operator, so that our model not only has the large effective receptive field required for downstream tasks such as detection and segmentation, but also has the adaptive spatial aggregation conditioned by input and task information. As a result, the proposed InternImage reduces the strict inductive bias of traditional CNNs and makes it possible to learn stronger and more robust patterns with large-scale parameters from massive data like ViTs. The effectiveness of our model is proven on challenging benchmarks including ImageNet, COCO, and ADE20K. It is worth mentioning that InternImage-H achieved a new record 65.4 mAP on COCO test-dev and 62.9 mIoU on ADE20K, outperforming current leading CNNs and ViTs.

引用

页码：14408 / 14419

页数：12

共 50 条

[1] Can surgical computer vision benefit from large-scale visual foundation models?
Rabbani, Navid
Bartoli, Adrien
INTERNATIONAL JOURNAL OF COMPUTER ASSISTED RADIOLOGY AND SURGERY, 2024, 19 (06) : 1157 - 1163
[2] A Large-Scale Evaluation of Speech Foundation Models
Yang, Shu-wen
Chang, Heng-Jui
Huang, Zili
Liu, Andy T.
Lai, Cheng-, I
Wu, Haibin
Shi, Jiatong
Chang, Xuankai
Tsai, Hsiang-Sheng
Huang, Wen-Chin
Feng, Tzu-hsun
Chi, Po-Han
Lin, Yist Y.
Chuang, Yung-Sung
Huang, Tzu-Hsien
Tseng, Wei-Cheng
Lakhotia, Kushal
Li, Shang-Wen
Mohamed, Abdelrahman
Watanabe, Shinji
Lee, Hung-yi
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 2884 - 2899
[3] Training Large-Scale Foundation Models on Emerging AI Chips
Muhamed, Aashiq
Bock, Christian
Solanki, Rahul
Park, Youngsuk
Wang, Yida
Huan, Jun
PROCEEDINGS OF THE 29TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2023, 2023, : 5821 - 5822
[4] Towards Artwork Explanation in Large-scale Vision Language Models
Hayashi, Kazuki
Sakai, Yusuke
Kamigaito, Hidetaka
Hayashi, Katsuhiko
Watanabe, Taro
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2: SHORT PAPERS, 2024, : 705 - 729
[5] FloodCastBench: A Large-Scale Dataset and Foundation Models for Flood Modeling and Forecasting
Xu, Qingsong
Shi, Yilei
Zhao, Jie
Zhu, Xiao Xiang
SCIENTIFIC DATA, 2025, 12 (01)
[6] Exploring Large-Scale Solar Deployment in DOE's SunShot Vision Study
Drury, Easan
Brinkman, Greg
Denholm, Paul
Margolis, Robert
Mowers, Matthew
2012 38TH IEEE PHOTOVOLTAIC SPECIALISTS CONFERENCE (PVSC), 2012, : 763 - 768
[7] NetBench: A Large-Scale and Comprehensive Network Traffic Benchmark Dataset for Foundation Models
Qian, Chen
Li, Xiaochang
Wang, Qineng
Zhou, Gang
Shao, Huajie
PROCEEDINGS 2024 IEEE INTERNATIONAL WORKSHOP ON FOUNDATION MODELS FOR CYBER-PHYSICAL SYSTEMS & INTERNET OF THINGS, FMSYS 2024, 2024, : 20 - 25
[8] NetBench: A LARGE-SCALE AND COMPREHENSIVE NETWORK TRAFFIC BENCHMARK DATASET FOR FOUNDATION MODELS
Department of Computer Science William & Mary, United States
arXiv,
[9] LARGE-SCALE NATURAL VISION SIMULATIONS
LOURENS, T
PETKOV, N
KRUIZINGA, P
FUTURE GENERATION COMPUTER SYSTEMS, 1994, 10 (2-3) : 351 - 358
[10] AMD: Automatic Multi-step Distillation of Large-Scale Vision Models
Han, Cheng
Wang, Qifan
Dianat, Sohail A.
Rabbani, Majid
Rao, Raghuveer M.
Fang, Yi
Guan, Qiang
Huang, Lifu
Liu, Dongfang
COMPUTER VISION - ECCV 2024, PT LXV, 2025, 15123 : 431 - 450

← 1 2 3 4 5 →