Rotary Position Embedding for Vision Transformer

被引:2
|
作者
Heo, Byeongho [1 ]
Park, Song [1 ]
Han, Dongyoon [1 ]
Yun, Sangdoo [1 ]
机构
[1] NAVER AI Lab, Seongnam, South Korea
来源
关键词
D O I
10.1007/978-3-031-72684-2_17
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Rotary Position Embedding (RoPE) performs remarkably on language models, especially for length extrapolation of Transformers. However, the impacts of RoPE on computer vision domains have been underexplored, even though RoPE appears capable of enhancing Vision Transformer (ViT) performance in a way similar to the language domain. This study provides a comprehensive analysis of RoPE when applied to ViTs, utilizing practical implementations of RoPE for 2D vision data. The analysis reveals that RoPE demonstrates impressive extrapolation performance, i.e., maintaining precision while increasing image resolution at inference. It eventually leads to performance improvement for ImageNet-1k, COCO detection, and ADE-20k segmentation. We believe this study provides thorough guidelines to apply RoPE into ViT, promising improved backbone performance with minimal extra computational overhead. Our code and pre-trained models are available at https://github.com/naver-ai/rope-vit
引用
收藏
页码:289 / 305
页数:17
相关论文
共 50 条
  • [21] Relative-position embedding based spatially and temporally decoupled Transformer for action recognition
    Ma, Yujun
    Wang, Ruili
    PATTERN RECOGNITION, 2024, 145
  • [22] Image Dehazing Transformer with Transmission-Aware 3D Position Embedding
    Guo, Chunle
    Yan, Qixin
    Anwar, Saeed
    Cong, Runmin
    Ren, Wenqi
    Li, Chongyi
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 5802 - 5810
  • [23] Utilizing adaptive deformable convolution and position embedding for colon polyp segmentation with a visual transformer
    Sikkandar, Mohamed Yacin
    Sundaram, Sankar Ganesh
    Alassaf, Ahmad
    Almohimeed, Ibrahim
    Alhussaini, Khalid
    Aleid, Adham
    Alolayan, Salem Ali
    Ramkumar, P.
    Almutairi, Meshal Khalaf
    Begum, S. Sabarunisha
    SCIENTIFIC REPORTS, 2024, 14 (01)
  • [24] MolRoPE-BERT: An enhanced molecular representation with Rotary Position Embedding for molecular property prediction
    Liu, Yunwu
    Zhang, Ruisheng
    Li, Tongfeng
    Jiang, Jing
    Ma, Jun
    Wang, Ping
    JOURNAL OF MOLECULAR GRAPHICS & MODELLING, 2023, 118
  • [25] KILOWATT ROTARY POWER TRANSFORMER
    MARX, SH
    BOUNDS, RW
    IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS, 1971, AES7 (06) : 1157 - &
  • [26] Rotary Transformer for Image Captioning
    Qiu, Yile
    Zhu, Li
    SECOND INTERNATIONAL CONFERENCE ON OPTICS AND IMAGE PROCESSING (ICOIP 2022), 2022, 12328
  • [27] LaPE: Layer-adaptive Position Embedding for Vision Transformers with Independent Layer Normalization
    Yu, Runyi
    Wang, Zhennan
    Wang, Yinhuai
    Li, Kehan
    Liu, Chang
    Duan, Haoyi
    Ji, Xiangyang
    Chen, Jie
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 5863 - 5873
  • [28] CONVOLUATIONAL TRANSFORMER WITH ADAPTIVE POSITION EMBEDDING FOR COVID-19 DETECTION FROM COUGH SOUNDS
    Yan, Tianhao
    Meng, Hao
    Liu, Shuo
    Parada-Cabaleiro, Emilia
    Ren, Zhao
    Schuller, Bjoern W.
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 9092 - 9096
  • [29] GFPE-ViT: vision transformer with geometric-fractal-based position encoding
    Wang, Lei
    Tang, Xue-song
    Hao, Kuangrong
    VISUAL COMPUTER, 2025, 41 (02): : 1021 - 1036
  • [30] Graph Evolving and Embedding in Transformer
    Chien, Jen-Tzung
    Tsao, Chia-Wei
    PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 538 - 545