Towards End-to-End Generative Modeling of Long Videos with Memory-Efficient Bidirectional Transformers

被引:0
|
作者
Yoo, Jaehoon [1 ]
Kim, Semin [1 ]
Lee, Doyup [2 ]
Kim, Chiheon [2 ]
Hong, Seunghoon [1 ]
机构
[1] Korea Adv Inst Sci & Technol, Daejeon, South Korea
[2] Kakao Brain, Seongnam, South Korea
基金
新加坡国家研究基金会;
关键词
D O I
10.1109/CVPR52729.2023.02192
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Autoregressive transformers have shown remarkable success in video generation. However, the transformers are prohibited from directly learning the long-term dependency in videos due to the quadratic complexity of self-attention, and inherently suffering from slow inference time and error propagation due to the autoregressive process. In this paper, we propose Memory-efficient Bidirectional Transformer (MeBT) for end-to-end learning of long-term dependency in videos and fast inference. Based on recent advances in bidirectional transformers, our method learns to decode the entire spatio-temporal volume of a video in parallel from partially observed patches. The proposed transformer achieves a linear time complexity in both encoding and decoding, by projecting observable context tokens into a fixed number of latent tokens and conditioning them to decode the masked tokens through the cross-attention. Empowered by linear complexity and bidirectional modeling, our method demonstrates significant improvement over the autoregressive transformers for generating moderately long videos in both quality and speed. Videos and code are available at https://sites.google.com/view/mebt-cvpr2023.
引用
收藏
页码:22888 / 22897
页数:10
相关论文
共 38 条
  • [1] End-to-end memory-efficient reconstruction for cone beam CT
    Moriakov, Nikita
    Sonke, Jan-Jakob
    Teuwen, Jonas
    MEDICAL PHYSICS, 2023, 50 (12) : 7579 - 7593
  • [2] Towards End-to-End Image Compression and Analysis with Transformers
    Bai, Yuanchao
    Yang, Xu
    Liu, Xianming
    Jiang, Junjun
    Wang, Yaowei
    Ji, Xiangyang
    Gao, Wen
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 104 - 112
  • [3] LithoGAN: End-to-End Lithography Modeling with Generative Adversarial Networks
    Ye, Wei
    Alawieh, Mohamed Baker
    Lin, Yibo
    Pan, David Z.
    PROCEEDINGS OF THE 2019 56TH ACM/EDAC/IEEE DESIGN AUTOMATION CONFERENCE (DAC), 2019,
  • [4] Memory-efficient Temporal Moment Localization in Long Videos
    Rodriguez-Opazo, Cristian
    Marrese-Taylor, Edison
    Fernando, Basura
    Takamura, Hiroya
    Wu, Qi
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 1909 - 1924
  • [5] Towards End-to-End Learning for Efficient Dialogue Agent by Modeling Looking-ahead Ability
    Jiang, Zhuoxuan
    Mao, Xian-Ling
    Huang, Ziming
    Ma, Jie
    Li, Shaochun
    20TH ANNUAL MEETING OF THE SPECIAL INTEREST GROUP ON DISCOURSE AND DIALOGUE (SIGDIAL 2019), 2019, : 133 - 142
  • [6] Memory-Efficient Continual Learning Object Segmentation for Long Videos
    Nazemi, Amir
    Shafiee, Mohammad Javad
    Gharaee, Zahra
    Fieguth, Paul
    IEEE ACCESS, 2024, 12 : 97067 - 97084
  • [7] An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
    Fu, Tsu-Jui
    Li, Linjie
    Gan, Zhe
    Lin, Kevin
    Wang, William Yang
    Wang, Lijuan
    Liu, Zicheng
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 22898 - 22909
  • [8] Epileptic Seizure Detection with an End-to-End Temporal Convolutional Network and Bidirectional Long Short-Term Memory Model
    Dong, Xingchen
    Wen, Yiming
    Ji, Dezan
    Yuan, Shasha
    Liu, Zhen
    Shang, Wei
    Zhou, Weidong
    INTERNATIONAL JOURNAL OF NEURAL SYSTEMS, 2024, 34 (03)
  • [9] Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding in Long Videos
    Pan, Yulin
    He, Xiangteng
    Gong, Biao
    Lv, Yiliang
    Shen, Yujun
    Peng, Yuxin
    Zhao, Deli
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 13721 - 13731
  • [10] An efficient flow control plan for end-to-end delivery of pre-stored compressed videos
    Tong, SR
    Lee, SC
    IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA COMPUTING AND SYSTEMS, PROCEEDINGS VOL 2, 1999, : 622 - 627