Green Hierarchical Vision Transformer for Masked Image Modeling

被引:0
|
作者
Huang, Lang [1 ]
You, Shan [2 ]
Zheng, Mingkai [3 ]
Wang, Fei [2 ]
Qian, Chen [2 ]
Yamasaki, Toshihiko [1 ]
机构
[1] Univ Tokyo, Tokyo, Japan
[2] SenseTime Res, Hong Kong, Peoples R China
[3] Univ Sydney, Sydney, NSW, Australia
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present an efficient approach for Masked Image Modeling (MIM) with hierarchical Vision Transformers (ViTs), allowing the hierarchical ViTs to discard masked patches and operate only on the visible ones. Our approach consists of three key designs. First, for window attention, we propose a Group Window Attention scheme following the Divide-and-Conquer strategy. To mitigate the quadratic complexity of the self-attention w.r.t. the number of patches, group attention encourages a uniform partition that visible patches within each local window of arbitrary size can be grouped with equal size, where masked self-attention is then performed within each group. Second, we further improve the grouping strategy via the Dynamic Programming algorithm to minimize the overall computation cost of the attention on the grouped patches. Third, as for the convolution layers, we convert them to the Sparse Convolution [25, 13] that works seamlessly with the sparse data, i.e., the visible patches in MIM. As a result, MIM can now work on most, if not all, hierarchical ViTs in a green and efficient way. For example, we can train the hierarchical ViTs, e.g., Swin Transformer [49] and Twins Transformer [14], about 2.7 x faster and reduce the GPU memory usage by 70%, while still enjoying competitive performance on ImageNet classification and the superiority on downstream COCO object detection benchmarks.dagger
引用
收藏
页数:14
相关论文
共 50 条
  • [1] Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection
    Fang, Yuxin
    Yang, Shusheng
    Wang, Shijie
    Ge, Yixiao
    Shan, Ying
    Wang, Xinggang
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 6221 - 6230
  • [2] MIMTracking: Masked image modeling enhanced vision transformer for visual object tracking
    Zhang, Shuo
    Zhang, Dan
    Zou, Qi
    [J]. NEUROCOMPUTING, 2024, 606
  • [3] Image Retrieval Based on Vision Transformer and Masked Learning
    李锋
    潘煌圣
    盛守祥
    王国栋
    [J]. Journal of Donghua University(English Edition), 2023, 40 (05) : 539 - 547
  • [4] Hierarchical Pretrained Backbone Vision Transformer for Image Classification in Histopathology
    Zedda, Luca
    Loddo, Andrea
    Di Ruberto, Cecilia
    [J]. IMAGE ANALYSIS AND PROCESSING, ICIAP 2023, PT II, 2023, 14234 : 223 - 234
  • [5] Who Did They Respond to? Conversation Structure Modeling Using Masked Hierarchical Transformer
    Zhu, Henghui
    Nan, Feng
    Wang, Zhiguo
    Nallapati, Ramesh
    Xiang, Bing
    [J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 9741 - 9748
  • [6] Hint-Based Image Colorization Based on Hierarchical Vision Transformer
    Lee, Subin
    Jung, Yong Ju
    [J]. SENSORS, 2022, 22 (19)
  • [7] FishAI: Automated hierarchical marine fish image classification with vision transformer
    Yang, Chenghan
    Zhou, Peng
    Wang, Chun-Sheng
    Fu, Ge-Yi
    Xu, Xue-Wei
    Niu, Zhibin
    Zhu, Lin
    Yuan, Ye
    Shen, Hong-Bin
    Pan, Xiaoyong
    [J]. ENGINEERING REPORTS, 2024,
  • [8] Masked Vision-language Transformer in Fashion
    Ji, Ge-Peng
    Zhuge, Mingchen
    Gao, Dehong
    Fan, Deng-Ping
    Sakaridis, Christos
    Gool, Luc Van
    [J]. MACHINE INTELLIGENCE RESEARCH, 2023, 20 (03) : 421 - 434
  • [9] Masked Vision-language Transformer in Fashion
    Ge-Peng Ji
    Mingchen Zhuge
    Dehong Gao
    Deng-Ping Fan
    Christos Sakaridis
    Luc Van Gool
    [J]. Machine Intelligence Research, 2023, 20 : 421 - 434
  • [10] Masked Transformer for Image Anomaly Localization
    De Nardin, Axel
    Mishra, Pankaj
    Foresti, Gian Luca
    Piciarelli, Claudio
    [J]. INTERNATIONAL JOURNAL OF NEURAL SYSTEMS, 2022, 32 (07)