Towards Global Video Scene Segmentation with Context-Aware Transformer

被引:0
|
作者
Yang, Yang [1 ,2 ,3 ]
Huang, Yurui [1 ]
Guo, Weili [1 ]
Xu, Baohua [4 ]
Xia, Dingyin
机构
[1] Nanjing Univ Sci & Technol, Nanjing, Peoples R China
[2] NUAA, MIIT Key Lab Pattern Anal & Machine Intelligence, Nanjing, Peoples R China
[3] NJU, State Key Lab Novel Software Technol, Nanjing, Peoples R China
[4] HUAWEI CBG Edu Lab, Montreal, PQ, Canada
来源
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3 | 2023年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Videos such as movies or TV episodes usually need to divide the long storyline into cohesive units, i.e., scenes, to facilitate the understanding of video semantics. The key challenge lies in finding the boundaries of scenes by comprehensively considering the complex temporal structure and semantic in-formation. To this end, we introduce a novel Context-Aware Transformer (CAT) with a self-supervised learning framework to learn high-quality shot representations, for generating well-bounded scenes. More specifically, we design the CAT with local-global self-attentions, which can effectively consider both the long-term and short-term context to improve the shot encoding. For training the CAT, we adopt the self-supervised learning schema. Firstly, we leverage shot-to-scene level pretext tasks to facilitate the pre-training with pseudo boundary, which guides CAT to learn the discriminative shot representations that maximize intra-scene similarity and inter-scene discrimination in an unsupervised manner. Then, we transfer contextual representations for fine-tuning the CAT with supervised data, which encourages CAT to accurately detect the boundary for scene segmentation. As a result, CAT is able to learn the context-aware shot representations and provides global guidance for scene segmentation. Our empirical analyses show that CAT can achieve state-of-the-art performance when conducting the scene segmentation task on the MovieNet dataset, e.g., offering 2.15 improvements on AP.
引用
收藏
页码:3206 / 3213
页数:8
相关论文
共 50 条
  • [41] Context-aware local abnormality detection in crowded scene
    ZHU XiaoBin
    JIN Xin
    ZHANG XiaoYu
    LI ChangSheng
    HE FuGang
    WANG Lei
    Science China(Information Sciences), 2015, 58 (05) : 134 - 144
  • [42] CAP: Context-Aware Pruning for Semantic Segmentation
    He, Wei
    Wu, Meiqing
    Liang, Mingfu
    Lam, Siew-Kei
    2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2021), 2021, : 959 - 968
  • [43] Social Context-aware GCN for Video Character Search via Scene-prior Enhancement
    Peng, Wenjun
    He, Weidong
    Xu, Derong
    Xu, Tong
    Zhu, Chen
    Chen, Enhong
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 2609 - 2614
  • [44] Context-aware Horror Video Scene Recognition via Cost-sensitive Sparse Coding
    Ding, Xinmiao
    Li, Bing
    Hu, Weiming
    Xiong, Weihua
    Wang, Zhenchong
    2012 21ST INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR 2012), 2012, : 1904 - 1907
  • [45] Spatial Navigation for Context-Aware Video Surveillance
    de Haan, Gerwin
    Piguillet, Huib
    Post, Frits H.
    IEEE COMPUTER GRAPHICS AND APPLICATIONS, 2010, 30 (05) : 20 - 31
  • [46] Context-aware Synthesis for Video Frame Interpolation
    Niklaus, Simon
    Liu, Feng
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 1701 - 1710
  • [47] IntentQA: Context-aware Video Intent Reasoning
    Li, Jiapeng
    Wei, Ping
    Han, Wenjuan
    Fan, Lifeng
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11929 - 11940
  • [48] CCVS: Context-aware Controllable Video Synthesis
    Le Moing, Guillaume
    Ponce, Jean
    Schmid, Cordelia
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [49] ContextVP: Fully Context-Aware Video Prediction
    Byeon, Wonmin
    Wang, Qin
    Srivastava, Rupesh Kumar
    Koumoutsakos, Petros
    COMPUTER VISION - ECCV 2018, PT XVI, 2018, 11220 : 781 - 797
  • [50] Context-Aware Video Compression for Mobile Robots
    Lazewatsky, Daniel A.
    Giertler, Bogumil
    Witick, Martha
    Perlmutter, Leah
    Maxwell, Bruce A.
    Smart, William D.
    2011 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, 2011, : 4115 - 4120