Rotary Position Embedding for Vision Transformer

被引:2
|
作者
Heo, Byeongho [1 ]
Park, Song [1 ]
Han, Dongyoon [1 ]
Yun, Sangdoo [1 ]
机构
[1] NAVER AI Lab, Seongnam, South Korea
来源
关键词
D O I
10.1007/978-3-031-72684-2_17
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Rotary Position Embedding (RoPE) performs remarkably on language models, especially for length extrapolation of Transformers. However, the impacts of RoPE on computer vision domains have been underexplored, even though RoPE appears capable of enhancing Vision Transformer (ViT) performance in a way similar to the language domain. This study provides a comprehensive analysis of RoPE when applied to ViTs, utilizing practical implementations of RoPE for 2D vision data. The analysis reveals that RoPE demonstrates impressive extrapolation performance, i.e., maintaining precision while increasing image resolution at inference. It eventually leads to performance improvement for ImageNet-1k, COCO detection, and ADE-20k segmentation. We believe this study provides thorough guidelines to apply RoPE into ViT, promising improved backbone performance with minimal extra computational overhead. Our code and pre-trained models are available at https://github.com/naver-ai/rope-vit
引用
收藏
页码:289 / 305
页数:17
相关论文
共 50 条
  • [41] Sufficient Vision Transformer
    Cheng, Zhi
    Su, Xiu
    Wang, Xueyu
    You, Shan
    Xu, Chang
    PROCEEDINGS OF THE 28TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2022, 2022, : 190 - 200
  • [42] Design of a new separable rotary transformer
    Gong, X. F.
    Zhang, L.
    Feng, E. J.
    2017 3RD INTERNATIONAL CONFERENCE ON APPLIED MATERIALS AND MANUFACTURING TECHNOLOGY (ICAMMT 2017), 2017, 242
  • [43] A position on vision
    Yates, Darran
    NATURE REVIEWS NEUROSCIENCE, 2018, 19 (11) : 642 - 642
  • [44] LGViT: A Local and Global Vision Transformer with Dynamic Contextual Position Bias Using Overlapping Windows
    Zhou, Qian
    Zou, Hua
    Wu, Huanhuan
    APPLIED SCIENCES-BASEL, 2023, 13 (03):
  • [45] A position on vision
    Darran Yates
    Nature Reviews Neuroscience, 2018, 19 : 642 - 642
  • [46] DualToken-ViT: Position-aware Efficient Vision Transformer with Dual Token Fusion
    Chu, Zhenzhen
    Chen, Jiayu
    Chen, Cen
    Wang, Chengyu
    Wu, Ziheng
    Huang, Jun
    Qian, Weining
    PROCEEDINGS OF THE 2024 SIAM INTERNATIONAL CONFERENCE ON DATA MINING, SDM, 2024, : 688 - 696
  • [47] ImplantFormer: vision transformer-based implant position regression using dental CBCT data
    Yang, Xinquan
    Li, Xuguang
    Li, Xuechen
    Wu, Peixi
    Shen, Linlin
    Deng, Yongqiang
    NEURAL COMPUTING & APPLICATIONS, 2024, 36 (12): : 6643 - 6658
  • [48] ImplantFormer: vision transformer-based implant position regression using dental CBCT data
    Xinquan Yang
    Xuguang Li
    Xuechen Li
    Peixi Wu
    Linlin Shen
    Yongqiang Deng
    Neural Computing and Applications, 2024, 36 : 6643 - 6658
  • [49] USE OF A CONTACTLESS ROTARY TRANSFORMER FOR AUTOMATIC RECORDING OF DEFORMATIONS WITH ROTARY RHEOVISCOSIMETERS
    PANOV, YN
    SAMSONOV.TI
    FRENKEL, SY
    INDUSTRIAL LABORATORY, 1966, 32 (08): : 1254 - &
  • [50] Compensation Modeling and Optimization on Contactless Rotary Transformer in Rotary Ultrasonic Machining
    Zhang, Jianguo
    Long, Zhili
    Wang, Can
    Zhao, Heng
    Li, Yangmin
    JOURNAL OF MANUFACTURING SCIENCE AND ENGINEERING-TRANSACTIONS OF THE ASME, 2020, 142 (10):