A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing

被引:0
|
作者
Li, Maomao [1 ,2 ]
Li, Yu [2 ]
Yang, Tianyu [2 ]
Liu, Yunfei [2 ]
Yue, Dongxu [3 ]
Lin, Zhihui [4 ]
Xu, Dong [1 ]
机构
[1] Univ Hong Kong, Hong Kong, Peoples R China
[2] Int Digital Econ Acad IDEA, Shenzhen, Peoples R China
[3] Peking Univ, Beijing, Peoples R China
[4] Tsinghua Univ, Beijing, Peoples R China
来源
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024 | 2024年
关键词
QUALITY;
D O I
10.1109/CVPR52733.2024.00719
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents a video inversion approach for zero-shot video editing, which models the input video with low-rank representation during the inversion process. The existing video editing methods usually apply the typical 2D DDIM inversion or naive spatial-temporal DDIM inversion before editing, which leverages time-varying representation for each frame to derive noisy latent. Unlike most existing approaches, we propose a Spatial-Temporal Expectation-Maximization (STEM) inversion, which formulates the dense video feature under an expectation-maximization manner and iteratively estimates a more compact basis set to represent the whole video. Each frame applies the fixed and global representation for inversion, which is more friendly for temporal consistency during re-construction and editing. Extensive qualitative and quantitative experiments demonstrate that our STEM inversion can achieve consistent improvement on two state-of-the-art video editing methods. Project page: https://stem-inv.github.io/page/.
引用
收藏
页码:7528 / 7537
页数:10
相关论文
共 19 条
  • [1] FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation
    Yang, Shuai
    Zhou, Yifan
    Liu, Ziwei
    Loy, Chen Change
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 8703 - 8712
  • [2] E3M: Zero-Shot Spatio-Temporal Video Grounding with Expectation-Maximization Multimodal Modulation
    Bao, Peijun
    Shao, Zihao
    Yang, Wenhan
    Ng, Boon Poh
    Kot, Alex C.
    COMPUTER VISION - ECCV 2024, PT LXXXIII, 2025, 15141 : 227 - 243
  • [3] VidToMe: Video Token Merging for Zero-Shot Video Editing
    Li, Xirui
    Ma, Chao
    Yang, Xiaokang
    Yang, Ming-Hsuan
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 7486 - 7495
  • [4] Video Shot Detection based on SIFT Features and Video Summarization using Expectation-Maximization
    Majumdar, Jharna
    Awale, Manish
    Kumar, Santhosh K. L.
    2018 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2018, : 1033 - 1037
  • [5] WAVE: Warping DDIM Inversion Features for Zero-Shot Text-to-Video Editing
    Feng, Yutang
    Gao, Sicheng
    Bao, Yuxiang
    Wang, Xiaodi
    Han, Shumin
    Zhang, Juan
    Zhang, Baochang
    Yao, Angela
    COMPUTER VISION - ECCV 2024, PT LXXVI, 2025, 15134 : 38 - 55
  • [6] Orthogonal Temporal Interpolation for Zero-Shot Video Recognition
    Zhu, Yan
    Zhuo, Junbao
    Ma, Bin
    Geng, Jiajia
    Wei, Xiaoming
    Wei, Xiaolin
    Wang, Shuhui
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 7491 - 7501
  • [7] Analysis of Video Shot Detection using Color Layout Descriptor and Video Summarization based on Expectation-Maximization Clustering
    Majumdar, Jharna
    Kumar, Santhosh K. L.
    Venkatesh, G. M.
    2015 INTERNATIONAL CONFERENCE ON COGNITIVE COMPUTING AND INFORMATION PROCESSING (CCIP), 2015,
  • [8] An Image Grid Can Be Worth a Video: Zero-Shot Video Question Answering Using a VLM
    Kim, Wonkyun
    Choi, Changin
    Lee, Wonseok
    Rhee, Wonjong
    IEEE ACCESS, 2024, 12 : 193057 - 193075
  • [9] FateZero: Fusing Attentions for Zero-shot Text-based Video Editing
    Qi, Chenyang
    Cun, Xiaodong
    Zhang, Yong
    Lei, Chenyang
    Wang, Xintao
    Shan, Ying
    Chen, Qifeng
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15886 - 15896
  • [10] Emotion Detection in E-learning Using Expectation-Maximization Deep Spatial-Temporal Inference Network
    Xu, Jiangqin
    Huang, Zhongqiang
    Shi, Minghui
    Jiang, Min
    ADVANCES IN COMPUTATIONAL INTELLIGENCE SYSTEMS, 2018, 650 : 245 - 252