UniFormer: Unifying Convolution and Self-Attention for Visual Recognition

被引:99
|
作者
Li, Kunchang [1 ,2 ]
Wang, Yali [1 ,4 ]
Zhang, Junhao [3 ]
Gao, Peng [4 ]
Song, Guanglu [5 ]
Liu, Yu [5 ]
Li, Hongsheng [6 ]
Qiao, Yu [1 ,4 ]
机构
[1] Chinese Acad Sci, Shenzhen Inst Adv Technol, ShenZhen Key Lab Comp Vis & Pattern Recognit, Shenzhen 518055, Peoples R China
[2] Univ Chinese Acad Sci, Beijing 100049, Peoples R China
[3] Natl Univ Singapore, Singapore 119077, Singapore
[4] Shanghai Artificial Intelligence Lab, Shanghai 200232, Peoples R China
[5] SenseTime Res, Shanghai 200233, Peoples R China
[6] Chinese Univ Hong Kong, Hong Kong, Peoples R China
基金
中国国家自然科学基金;
关键词
UniFormer; convolution neural network; transformer; self-attention; visual recognition;
D O I
10.1109/TPAMI.2023.3282631
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
It is a challenging task to learn discriminative representation from images and videos, due to large local redundancy and complex global dependency in these visual data. Convolution neural networks (CNNs) and vision transformers (ViTs) have been two dominant frameworks in the past few years. Though CNNs can efficiently decrease local redundancy by convolution within a small neighborhood, the limited receptive field makes it hard to capture global dependency. Alternatively, ViTs can effectively capture long-range dependency via self-attention, while blind similarity comparisons among all the tokens lead to high redundancy. To resolve these problems, we propose a novel Unified transFormer (UniFormer), which can seamlessly integrate the merits of convolution and self-attention in a concise transformer format. Different from the typical transformer blocks, the relation aggregators in our UniFormer block are equipped with local and global token affinity respectively in shallow and deep layers, allowing tackling both redundancy and dependency for efficient and effective representation learning. Finally, we flexibly stack our blocks into a new powerful backbone, and adopt it for various vision tasks from image to video domain, from classification to dense prediction. Without any extra training data, our UniFormer achieves 86.3 top-1 accuracy on ImageNet-1 K classification task. With only ImageNet-1 K pre-training, it can simply achieve state-of-the-art performance in a broad range of downstream tasks. It obtains 82.9/84.8 top-1 accuracy on Kinetics-400/600, 60.9/71.2 top-1 accuracy on Something-Something V1/V2 video classification tasks, 53.8 box AP and 46.4 mask AP on COCO object detection task, 50.8 mIoU on ADE20 K semantic segmentation task, and 77.4 AP on COCO pose estimation task. Moreover, we build an efficient UniFormer with a concise hourglass design of token shrinking and recovering, which achieves 2-4xhigher throughput than the recent lightweight models.
引用
收藏
页码:12581 / 12600
页数:20
相关论文
共 50 条
  • [1] CMAT: Integrating Convolution Mixer and Self-Attention for Visual Tracking
    Wang, Jun
    Yin, Peng
    Wang, Yuanyun
    Yang, Wenhui
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 326 - 338
  • [2] Lightweight Smoke Recognition Based on Deep Convolution and Self-Attention
    Zhao, Yang
    Wang, Yigang
    Jung, Hoi-Kyung
    Jin, Yongqiang
    Hua, Dan
    Xu, Sen
    MATHEMATICAL PROBLEMS IN ENGINEERING, 2022, 2022
  • [3] On the Integration of Self-Attention and Convolution
    Pan, Xuran
    Ge, Chunjiang
    Lu, Rui
    Song, Shiji
    Chen, Guanfu
    Huang, Zeyi
    Huang, Gao
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 805 - 815
  • [4] A visual self-attention network for facial expression recognition
    Yu, Naigong
    Bai, Deguo
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [5] Masked face recognition with convolutional visual self-attention network
    Ge, Yiming
    Liu, Hui
    Du, Junzhao
    Li, Zehua
    Wei, Yuheng
    NEUROCOMPUTING, 2023, 518 : 496 - 506
  • [6] Global Self-Attention as a Replacement for Graph Convolution
    Hussain, Md Shamim
    Zaki, Mohammed J.
    Subramanian, Dharmashankar
    PROCEEDINGS OF THE 28TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2022, 2022, : 655 - 665
  • [7] SelfGCN: Graph Convolution Network With Self-Attention for Skeleton-Based Action Recognition
    Wu, Zhize
    Sun, Pengpeng
    Chen, Xin
    Tang, Keke
    Xu, Tong
    Zou, Le
    Wang, Xiaofeng
    Tan, Ming
    Cheng, Fan
    Weise, Thomas
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 4391 - 4403
  • [8] Stock Prediction Method Combining Graph Convolution and Convolution Self-Attention
    Tian, Hongli
    Cui, Yao
    Yan, Huiqiang
    Computer Engineering and Applications, 2024, 60 (04) : 192 - 199
  • [9] Self-attention for Speech Emotion Recognition
    Tarantino, Lorenzo
    Garner, Philip N.
    Lazaridis, Alexandros
    INTERSPEECH 2019, 2019, : 2578 - 2582
  • [10] CellCentroidFormer: Combining Self-attention and Convolution for Cell Detection
    Wagner, Royden
    Rohr, Karl
    MEDICAL IMAGE UNDERSTANDING AND ANALYSIS, MIUA 2022, 2022, 13413 : 212 - 222