Video Scene Parsing with Predictive Feature Learning

被引:66
|
作者
Jin, Xiaojie [1 ]
Li, Xin [2 ]
Xiao, Huaxin [2 ]
Shen, Xiaohui [3 ]
Lin, Zhe [3 ]
Yang, Jimei [3 ]
Chen, Yunpeng [2 ]
Dong, Jian [5 ]
Liu, Luoqi [4 ]
Jie, Zequn [4 ]
Feng, Jiashi [2 ]
Yan, Shuicheng [2 ,5 ]
机构
[1] NUS, NUS Grad Sch Integrat Sci & Engn NGS, Singapore, Singapore
[2] NUS, Dept ECE, Singapore, Singapore
[3] Adobe Res, San Jose, CA USA
[4] Tencent AI Lab, Seattle, WA USA
[5] 360 AI Inst, Ellicott City, MD USA
关键词
D O I
10.1109/ICCV.2017.595
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video scene parsing is challenging due to the following two reasons: firstly, it is non-trivial to learn meaningful video representations for producing the temporally consistent labeling map; secondly, such a learning process becomes more difficult with insufficient labeled video training data. In this work, we propose a unified framework to address the above two problems, which is to our knowledge the first model to employ predictive feature learning in the video scene parsing. The predictive feature learning is carried out in two predictive tasks: frame prediction and predictive parsing. It is experimentally proved that the learned predictive features in our model are able to significantly enhance the video parsing performance by combining with the standard image parsing network. Interestingly, the performance gain brought by the predictive learning is almost costless as the features are learned from a large amount of unlabeled video data in an unsupervised way. Extensive experiments over two challenging datasets, Cityscapes and Camvid, have demonstrated the effectiveness of our model by showing remarkable improvement over well-established baselines.
引用
收藏
页码:5581 / 5589
页数:9
相关论文
共 50 条
  • [21] Soft video parsing by label distribution learning
    Miaogen Ling
    Xin Geng
    Frontiers of Computer Science, 2019, 13 : 302 - 317
  • [22] Scene Consistency Representation Learning for Video Scene Segmentation
    Wu, Haoqian
    Chen, Keyu
    Luo, Yanan
    Qiao, Ruizhi
    Ren, Bo
    Liu, Haozhe
    Xie, Weicheng
    Shen, Linlin
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 14001 - 14010
  • [23] Depth Embedded Recurrent Predictive Parsing Network for Video Scenes
    Zhou, Lingli
    Zhang, Haofeng
    Long, Yang
    Shao, Ling
    Yang, Jingyu
    IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2019, 20 (12) : 4643 - 4654
  • [24] A Survey on Algorithm Research of Scene Parsing Based on Deep Learning
    Zhang R.
    Li J.
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2020, 57 (04): : 859 - 875
  • [25] Hierarchical Scene Parsing by Weakly Supervised Learning with Image Descriptions
    Zhang, Ruimao
    Lin, Liang
    Wang, Guangrun
    Wang, Meng
    Zuo, Wangmeng
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (03) : 596 - 610
  • [26] EFRNet: Efficient Feature Reconstructing Network for Real-Time Scene Parsing
    Li, Xin
    Yang, Fan
    Luo, Ao
    Jiao, Zhicheng
    Cheng, Hong
    Liu, Zicheng
    IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 2852 - 2865
  • [27] FRNet: Feature Reconstruction Network for RGB-D Indoor Scene Parsing
    Zhou, Wujie
    Yang, Enquan
    Lei, Jingsheng
    Yu, Lu
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (04) : 677 - 687
  • [28] Learning to Exploit Stability for 3D Scene Parsing
    Du, Yilun
    Liu, Zhijian
    Basevi, Hector
    Leonardis, Ales
    Freeman, William T.
    Tenenbaum, Joshua B.
    Wu, Jiajun
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [29] Hierarchical Parsing Net: Semantic Scene Parsing From Global Scene to Objects
    Shi, Hengcan
    Li, Hongliang
    Meng, Fanman
    Wu, Qingbo
    Xu, Linfeng
    Ngan, King Ngi
    IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (10) : 2670 - 2682
  • [30] Learning and parsing video events with goal and intent prediction
    Pei, Mingtao
    Si, Zhangzhang
    Yao, Benjamin Z.
    Zhu, Song-Chun
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2013, 117 (10) : 1369 - 1383