A Local-Global Interactive Vision Transformer for Aerial Scene Classification

被引:6
|
作者
Peng, Ting [1 ]
Yi, Jingjun [2 ]
Fang, Yuan [3 ]
机构
[1] Ningbo Univ Finance & Econ, Coll Digital Technol & Engn, Ningbo 315175, Peoples R China
[2] Wuhan Univ, Sch Remote Sensing & Informat Engn, Wuhan 430079, Peoples R China
[3] Naval Univ Engn, Coll Power Engn, Wuhan 430033, Peoples R China
关键词
Feature extraction; Semantics; Transformers; Task analysis; Remote sensing; Pipelines; Neural networks; Aerial scene classification; feature interaction learning; local-global representation; semantic consistency loss; vision transformer (ViT);
D O I
10.1109/LGRS.2023.3266008
中图分类号
P3 [地球物理学]; P59 [地球化学];
学科分类号
0708 ; 070902 ;
摘要
Generic image classification has been widely studied in the past decade. However, for the bird-view aerial images, aerial scene classification remains challenging due to the dramatic variation of the scale and object size. Existing methods usually learn the aerial scene representation from the convolutional neural networks (CNNs), which focus on the local response of an image. In contrast, the recently developed vision transformers (ViTs) can learn stronger global representation for aerial scenes, but are not qualified enough to highlight the key objects in an aerial scene due to the dramatic size and scale variation. To address this challenge, in this letter, we propose a local-global interactive ViT (LG-ViT) for this task. It is based on our deliberately designed local-global feature interactive learning scheme, which intends to jointly utilize the local-wise and global-wise feature representations. To realize the learning scheme in an end-to-end manner, the proposed LG-ViT consists of three key components, namely local-global feature extraction (LGFE), local-global feature interaction (LGFI), and local-global semantic constraints. Extensive experiments on three aerial scene classification benchmarks, namely UC Merced Land Use Dataset (UCM), Aerial Image Dataset (AID), and Northwestern Polytechnical University (NWPU), demonstrate the effectiveness of the proposed LG-ViT against the state-of-the-art methods. The effectiveness of each component and generalization capability are also validated.
引用
收藏
页数:5
相关论文
共 50 条
  • [1] ASiT: Local-Global Audio Spectrogram Vision Transformer for Event Classification
    Ahmed, Sara Atito Ali
    Awais, Muhammad
    Wang, Wenwu
    Plumbley, Mark D.
    Kittler, Josef
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 3684 - 3693
  • [2] LGLFormer: Local-Global Lifting Transformer for Remote Sensing Scene Parsing
    Yang, Yuting
    Jiao, Licheng
    Li, Lingling
    Liu, Xu
    Liu, Fang
    Chen, Puhua
    Yang, Shuyuan
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 1 - 13
  • [3] Remote Sensing Scene Classification by Local-Global Mutual Learning
    Chen, Xiumei
    Zheng, Xiangtao
    Zhang, Yue
    Lu, Xiaoqiang
    [J]. IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2022, 19
  • [4] Remote Sensing Scene Classification by Local-Global Mutual Learning
    Chen, Xiumei
    Zheng, Xiangtao
    Zhang, Yue
    Lu, Xiaoqiang
    [J]. IEEE Geoscience and Remote Sensing Letters, 2022, 19
  • [5] Global-Local Attention Network for Aerial Scene Classification
    Guo, Yiyou
    Ji, Jinsheng
    Lu, Xiankai
    Huo, Hong
    Fang, Tao
    Li, Deren
    [J]. IEEE ACCESS, 2019, 7 : 67200 - 67212
  • [6] Remote Sensing Scene Classification Based on Local Selection Vision Transformer
    Yang Kai
    Lu Xiaoqiang
    [J]. LASER & OPTOELECTRONICS PROGRESS, 2023, 60 (22)
  • [7] Fully Convolutional Transformer with Local-Global Attention
    Lee, Sihaeng
    Yi, Eojindl
    Lee, Janghyeon
    Yoo, Jinsu
    Lee, Honglak
    Kim, Seung Hwan
    [J]. 2022 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2022, : 552 - 559
  • [8] A new local-global approach for classification
    Peres, R. T.
    Pedreira, C. E.
    [J]. NEURAL NETWORKS, 2010, 23 (07) : 887 - 891
  • [9] Tapping the power of local knowledge: A local-global interactive perspective
    Li, Shenxue
    Easterby-Smith, Mark
    Lyles, Marjorie A.
    Clark, Timothy
    [J]. JOURNAL OF WORLD BUSINESS, 2016, 51 (04) : 641 - 653
  • [10] Hierarchical Local-Global Transformer for Temporal Sentence Grounding
    Fang, Xiang
    Liu, Daizong
    Zhou, Pan
    Xu, Zichuan
    Li, Ruixuan
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 3263 - 3277