Rotary Position Embedding for Vision Transformer

被引:2
|
作者
Heo, Byeongho [1 ]
Park, Song [1 ]
Han, Dongyoon [1 ]
Yun, Sangdoo [1 ]
机构
[1] NAVER AI Lab, Seongnam, South Korea
来源
关键词
D O I
10.1007/978-3-031-72684-2_17
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Rotary Position Embedding (RoPE) performs remarkably on language models, especially for length extrapolation of Transformers. However, the impacts of RoPE on computer vision domains have been underexplored, even though RoPE appears capable of enhancing Vision Transformer (ViT) performance in a way similar to the language domain. This study provides a comprehensive analysis of RoPE when applied to ViTs, utilizing practical implementations of RoPE for 2D vision data. The analysis reveals that RoPE demonstrates impressive extrapolation performance, i.e., maintaining precision while increasing image resolution at inference. It eventually leads to performance improvement for ImageNet-1k, COCO detection, and ADE-20k segmentation. We believe this study provides thorough guidelines to apply RoPE into ViT, promising improved backbone performance with minimal extra computational overhead. Our code and pre-trained models are available at https://github.com/naver-ai/rope-vit
引用
收藏
页码:289 / 305
页数:17
相关论文
共 50 条
  • [1] RoFormer: Enhanced transformer with Rotary Position Embedding
    Su, Jianlin
    Ahmed, Murtadha
    Lu, Yu
    Pan, Shengfeng
    Bo, Wen
    Liu, Yunfeng
    NEUROCOMPUTING, 2024, 568
  • [2] An enhanced vision transformer with wavelet position embedding for histopathological image classification
    Ding, Meidan
    Qu, Aiping
    Zhong, Haiqin
    Lai, Zhihui
    Xiao, Shuomin
    He, Penghui
    PATTERN RECOGNITION, 2023, 140
  • [3] Transformer-Based End-to-End Speech Translation With Rotary Position Embedding
    Li, Xueqing
    Li, Shengqiang
    Zhang, Xiao-Lei
    Rahardja, Susanto
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 371 - 375
  • [4] Vision Transformer with pre-positional embedding
    Eguchi, Takuro
    Kuroki, Yoshimitsu
    INTERNATIONAL WORKSHOP ON ADVANCED IMAGING TECHNOLOGY, IWAIT 2024, 2024, 13164
  • [5] Rethinking Position Embedding Methods in the Transformer Architecture
    Xin Zhou
    Zhaohui Ren
    Shihua Zhou
    Zeyu Jiang
    TianZhuang Yu
    Hengfa Luo
    Neural Processing Letters, 56
  • [6] Rethinking Position Embedding Methods in the Transformer Architecture
    Zhou, Xin
    Ren, Zhaohui
    Zhou, Shihua
    Jiang, Zeyu
    Yu, Tianzhuang
    Luo, Hengfa
    NEURAL PROCESSING LETTERS, 2024, 56 (02)
  • [7] A CCD position servo system based on rotary transformer
    Wu Shilin
    Zhang Qi
    Zhu Zhaoxuan
    ISTM/2007: 7TH INTERNATIONAL SYMPOSIUM ON TEST AND MEASUREMENT, VOLS 1-7, CONFERENCE PROCEEDINGS, 2007, : 3411 - 3414
  • [8] Convolutional Embedding Makes Hierarchical Vision Transformer Stronger
    Wang, Cong
    Xu, Hongmin
    Zhang, Xiong
    Wang, Li
    Zheng, Zhitong
    Liu, Haifeng
    COMPUTER VISION, ECCV 2022, PT XX, 2022, 13680 : 739 - 756
  • [9] Position embedding fusion on transformer for dense video captioning
    Yang, Sixuan
    Tang, Pengjie
    Wang, Hanli
    Li, Qinyu
    DEVELOPMENTS OF ARTIFICIAL INTELLIGENCE TECHNOLOGIES IN COMPUTATION AND ROBOTICS, 2020, 12 : 792 - 799
  • [10] The encoding method of position embeddings in vision transformer
    Jiang, Kai
    Peng, Peng
    Lian, Youzao
    Xu, Weisheng
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2022, 89