Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation

被引:97
|
作者
Gu, Jiaqi [1 ]
Kwon, Hyoukjun [2 ]
Wang, Dilin [2 ]
Ye, Wei [2 ]
Li, Meng [2 ]
Chen, Yu-Hsin [2 ]
Lai, Liangzhen [2 ]
Chandra, Vikas [2 ]
Pan, David Z. [1 ]
机构
[1] Univ Texas Austin, Austin, TX 78712 USA
[2] Meta Platforms Inc, Menlo Pk, CA USA
关键词
D O I
10.1109/CVPR52688.2022.01178
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision Transformers (ViTs) have emerged with superior performance on computer vision tasks compared to the convolutional neural network (CNN)-based models. However, ViTs mainly designed for image classification will generate single-scale low-resolution representations, which makes dense prediction tasks such as semantic segmentation challenging for ViTs. Therefore, we propose HRViT, which enhances ViTs to learn semantically-rich and spatially-precise multi-scale representations by integrating high-resolution multi-branch architectures with ViTs. We balance the model performance and efficiency of HRViT by various branch-block co-optimization techniques. Specifically, we explore heterogeneous branch designs, reduce the redundancy in linear layers, and augment the attention block with enhanced expressiveness. Those approaches enabled HRViT to push the Pareto frontier of performance and efficiency on semantic segmentation to a new level, as our evaluation results on ADE20K and Cityscapes show. HRViT achieves 50.20% mIoU on ADE20K and 83.16% mIoU on Cityscapes, surpassing state-of-the-art MiT and CSWin backbones with an average of +1.78 mIoU improvement, 28% parameter saving, and 21% FLOPs reduction, demonstrating the potential of HRViT as a strong vision backbone for semantic segmentation. Our code is publicly available(1).
引用
收藏
页码:12084 / 12093
页数:10
相关论文
共 50 条
  • [1] Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding
    Zhang, Pengchuan
    Dai, Xiyang
    Yang, Jianwei
    Xiao, Bin
    Yuan, Lu
    Zhang, Lei
    Gao, Jianfeng
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 2978 - 2988
  • [2] Local-enhanced multi-scale aggregation swin transformer for semantic segmentation of high-resolution remote sensing images
    Ren, Dong
    Li, Falin
    Sun, Hang
    Liu, Li
    Ren, Shun
    Yu, Mei
    [J]. INTERNATIONAL JOURNAL OF REMOTE SENSING, 2024, 45 (01) : 101 - 120
  • [3] Learning Dual Multi-Scale Manifold Ranking for Semantic Segmentation of High-Resolution Images
    Zhang, Mi
    Hu, Xiangyun
    Zhao, Like
    Lv, Ye
    Luo, Min
    Pang, Shiyan
    [J]. REMOTE SENSING, 2017, 9 (05)
  • [4] Recurrent Multi-scale Transformer for High-Resolution Salient Object Detection
    Deng, Xinhao
    Zhang, Pingping
    Liu, Wei
    Lu, Huchuan
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 7413 - 7423
  • [5] Unsupervised Multi-Scale Hybrid Feature Extraction Network for Semantic Segmentation of High-Resolution Remote Sensing Images
    Song, Wanying
    Nie, Fangxin
    Wang, Chi
    Jiang, Yinyin
    Wu, Yan
    [J]. Remote Sensing, 2024, 16 (20)
  • [6] MUSTER: A Multi-Scale Transformer-Based Decoder for Semantic Segmentation
    Xu, Jing
    Shi, Wentao
    Gao, Pan
    Li, Qizhu
    Wang, Zhengwei
    [J]. IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2024,
  • [7] A mean shift multi-scale segmentation for high-resolution remote sensing images
    Shen, Zhanfeng
    Luo, Jiancheng
    Hu, Xiaodong
    Sun, Weigang
    [J]. Wuhan Daxue Xuebao (Xinxi Kexue Ban)/ Geomatics and Information Science of Wuhan University, 2010, 35 (03): : 313 - 316
  • [8] Crop classification in high-resolution remote sensing images based on multi-scale feature fusion semantic segmentation model
    Lu, Tingyu
    Gao, Meixiang
    Wang, Lei
    [J]. FRONTIERS IN PLANT SCIENCE, 2023, 14
  • [9] ASPP+-LANet: A Multi-Scale Context Extraction Network for Semantic Segmentation of High-Resolution Remote Sensing Images
    Hu, Lei
    Zhou, Xun
    Ruan, Jiachen
    Li, Supeng
    [J]. REMOTE SENSING, 2024, 16 (06)
  • [10] Multi-scale Feature Fusion and Transformer Network for urban green space segmentation from high-resolution remote sensing images
    Cheng, Yong
    Wang, Wei
    Ren, Zhoupeng
    Zhao, Yingfen
    Liao, Yilan
    Ge, Yong
    Wang, Jun
    He, Jiaxin
    Gu, Yakang
    Wang, Yixuan
    Zhang, Wenjie
    Zhang, Ce
    [J]. INTERNATIONAL JOURNAL OF APPLIED EARTH OBSERVATION AND GEOINFORMATION, 2023, 124