In Defense of Image Pre-Training for Spatiotemporal Recognition

被引:0
|
作者
Li, Xianhang [1 ]
Wang, Huiyu [2 ]
Wei, Chen [2 ]
Mei, Jieru [2 ]
Yuille, Alan [2 ]
Zhou, Yuyin [1 ]
Xie, Cihang [1 ]
机构
[1] Univ Calif Santa Cruz, Santa Cruz, CA 95064 USA
[2] Johns Hopkins Univ, Baltimore, MD USA
来源
关键词
Video classification; Imagenet pre-training; 3d convolution networks;
D O I
10.1007/978-3-031-19806-9_39
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image pre-training, the current de-facto paradigm for a wide range of visual tasks, is generally less favored in the field of video recognition. By contrast, a common strategy is to directly train with spatiotemporal convolutional neural networks (CNNs) from scratch. Nonetheless, interestingly, by taking a closer look at these from-scratch learned CNNs, we note there exist certain 3D kernels that exhibit much stronger appearance modeling ability than others, arguably suggesting appearance information is already well disentangled in learning. Inspired by this observation, we hypothesize that the key to effectively leveraging image pretraining lies in the decomposition of learning spatial and temporal features, and revisiting image pre-training as the appearance prior to initializing 3D kernels. In addition, we propose Spatial-Temporal Separable (STS) convolution, which explicitly splits the feature channels into spatial and temporal groups, to further enable a more thorough decomposition of spatiotemporal features for fine-tuning 3D CNNs. Our experiments show that simply replacing 3D convolution with STS notably improves a wide range of 3D CNNs without increasing parameters and computation on both Kinetics-400 and Something-Something V2. Moreover, this new training pipeline consistently achieves better results on video recognition with significant speedup. For instance, we achieve +0.6% top-1 of Slowfast on Kinetics-400 over the strong 256-epoch 128-GPU baseline while fine-tuning for only 50 epochs with 4 GPUs. The code and models are available at https://github.com/UCSC-VLAA/Image-Pretraining-for-Video.
引用
收藏
页码:675 / 691
页数:17
相关论文
共 50 条
  • [1] Image-Text Pre-Training for Logo Recognition
    Hubenthal, Mark
    Kumar, Suren
    [J]. 2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 1145 - 1154
  • [2] iCLIP: Bridging Image Classification and Contrastive Language-Image Pre-training for Visual Recognition
    Wei, Yixuan
    Cao, Yue
    Zhang, Zheng
    Peng, Houwen
    Yao, Zhuliang
    Xie, Zhenda
    Hue, Han
    Guo, Baining
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2776 - 2786
  • [3] Sigmoid Loss for Language Image Pre-Training
    Zhai, Xiaohua
    Mustafa, Basil
    Kolesnikov, Alexander
    Beyer, Lucas
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11941 - 11952
  • [4] Grounded Language-Image Pre-training
    Li, Liunian Harold
    Zhang, Pengchuan
    Zhang, Haotian
    Yang, Jianwei
    Li, Chunyuan
    Zhong, Yiwu
    Wang, Lijuan
    Yuan, Lu
    Zhang, Lei
    Hwang, Jenq-Neng
    Chang, Kai-Wei
    Gao, Jianfeng
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 10955 - 10965
  • [5] SELF-TRAINING AND PRE-TRAINING ARE COMPLEMENTARY FOR SPEECH RECOGNITION
    Xu, Qiantong
    Baevski, Alexei
    Likhomanenko, Tatiana
    Tomasello, Paden
    Conneau, Alexis
    Collobert, Ronan
    Synnaeve, Gabriel
    Auli, Michael
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 3030 - 3034
  • [6] Image Difference Captioning with Pre-training and Contrastive Learning
    Yao, Linli
    Wang, Weiying
    Jin, Qin
    [J]. THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 3108 - 3116
  • [7] FOOD IMAGE RECOGNITION USING DEEP CONVOLUTIONAL NETWORK WITH PRE-TRAINING AND FINE-TUNING
    Yanai, Keiji
    Kawano, Yoshiyuki
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO WORKSHOPS (ICMEW), 2015,
  • [8] A Study of Speech Recognition for Kazakh Based on Unsupervised Pre-Training
    Meng, Weijing
    Yolwas, Nurmemet
    [J]. SENSORS, 2023, 23 (02)
  • [9] LATTICEBART: LATTICE-TO-LATTICE PRE-TRAINING FOR SPEECH RECOGNITION
    Dai, Lingfeng
    Chen, Lu
    Zhou, Zhikai
    Yu, Kai
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6112 - 6116
  • [10] Coarse-to-Fine Pre-training for Named Entity Recognition
    Xue, Mengge
    Yu, Bowen
    Zhang, Zhenyu
    Liu, Tingwen
    Zhang, Yue
    Bin Wang
    [J]. PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 6345 - 6354