SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training

被引:3
|
作者
Lin, Yuanze [1 ]
Wei, Chen [2 ]
Wang, Huiyu [2 ]
Yuille, Alan [2 ]
Xie, Cihang [3 ]
机构
[1] Univ Washington, Seattle, WA 98195 USA
[2] Johns Hopkins Univ, Baltimore, MD 21218 USA
[3] UC Santa Cruz, Santa Cruz, CA USA
关键词
D O I
10.1109/ICCV51070.2023.00233
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video-language pre-training is crucial for learning powerful multi-modal representation. However, it typically requires a massive amount of computation. In this paper, we develop SMAUG, an efficient pre-training framework for video-language models. The foundation component in SMAUG is masked autoencoders. Different from prior works which only mask textual inputs, our masking strategy considers both visual and textual modalities, providing a better cross-modal alignment and saving more pre-training costs. On top of that, we introduce a space-time token sparsification module, which leverages context information to further select only "important" spatial regions and temporal frames for pre-training. Coupling all these designs allows our method to enjoy both competitive performances on text-to-video retrieval and video question answering tasks, and much less pre-training costs by 1.9x or more. For example, our SMAUG only needs similar to 50 NVIDIA A6000 GPU hours for pre-training to attain competitive performances on these two video-language tasks across six popular benchmarks.
引用
收藏
页码:2459 / 2469
页数:11
相关论文
共 50 条
  • [1] Survey: Transformer based video-language pre-training
    Ruan, Ludan
    Jin, Qin
    AI OPEN, 2022, 3 : 1 - 13
  • [2] HiVLP: Hierarchical Interactive Video-Language Pre-Training
    Shao, Bin
    Liu, Jianzhuang
    Pei, Renjing
    Xu, Songcen
    Dai, Peng
    Lu, Juwei
    Li, Weimian
    Yan, Youliang
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 13710 - 13720
  • [3] Object-aware Video-language Pre-training for Retrieval
    Wang, Alex Jinpeng
    Ge, Yixiao
    Cai, Guanyu
    Yan, Rui
    Lin, Xudong
    Shan, Ying
    Qie, Xiaohu
    Shou, Mike Zheng
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 3303 - 3312
  • [4] All in One: Exploring Unified Video-Language Pre-training
    Wang, Jinpeng
    Ge, Yixiao
    Yan, Rui
    Ge, Yuying
    Lin, Kevin Qinghong
    Tsutsui, Satoshi
    Lin, Xudong
    Cai, Guanyu
    Wu, Jianping
    Shan, Ying
    Qie, Xiaohu
    Shou, Mike Zheng
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6598 - 6608
  • [5] HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
    Ye, Qinghao
    Xu, Guohai
    Yan, Ming
    Xu, Haiyang
    Qian, Qi
    Zhang, Ji
    Huang, Fei
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15359 - 15370
  • [6] Focus and Align: Learning Tube Tokens for Video-Language Pre-Training
    Zhu, Yongqing
    Li, Xiangyang
    Zheng, Mao
    Yang, Jiahao
    Wang, Zihan
    Guo, Xiaoqian
    Chai, Zifeng
    Yuan, Yuchen
    Jiang, Shuqiang
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 8036 - 8050
  • [7] VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding
    Xu, Hu
    Ghosh, Gargi
    Huang, Po-Yao
    Arora, Prahal
    Aminzadeh, Masoumeh
    Feichtenhofer, Christoph
    Metze, Florian
    Zettlemoyer, Luke
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 4227 - 4239
  • [8] EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
    Pramanick, Shraman
    Song, Yale
    Nag, Sayan
    Lin, Kevin Qinghong
    Shah, Hardik
    Shou, Mike Zheng
    Chellappa, Rama
    Zhang, Pengchuan
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 5262 - 5274
  • [9] Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning
    Sun, Yuchong
    Xue, Hongwei
    Song, Ruihua
    Liu, Bei
    Yang, Huan
    Fu, Jianlong
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [10] Masked Latent Semantic Modeling: an Efficient Pre-training Alternative to Masked Language Modeling
    Berend, Gabor
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 13949 - 13962