Learning hierarchical video representation for action recognition

被引:15
|
作者
Li Q. [1 ]
Qiu Z. [1 ]
Yao T. [2 ]
Mei T. [2 ]
Rui Y. [2 ]
Luo J. [3 ]
机构
[1] University of Science and Technology of China, Hefei
[2] Microsoft Research, Beijing
[3] University of Rochester, New York
关键词
Action recognition; Deep learning; Video representation learning;
D O I
10.1007/s13735-016-0117-4
中图分类号
学科分类号
摘要
Video analysis is an important branch of computer vision due to its wide applications, ranging from video surveillance, video indexing, and retrieval to human computer interaction. All of the applications are based on a good video representation, which encodes video content into a feature vector with fixed length. Most existing methods treat video as a flat image sequence, but from our observations we argue that video is an information-intensive media with intrinsic hierarchical structure, which is largely ignored by previous approaches. Therefore, in this work, we represent the hierarchical structure of video with multiple granularities including, from short to long, single frame, consecutive frames (motion), short clip, and the entire video. Furthermore, we propose a novel deep learning framework to model each granularity individually. Specifically, we model the frame and motion granularities with 2D convolutional neural networks and model the clip and video granularities with 3D convolutional neural networks. Long Short-Term Memory networks are applied on the frame, motion, and clip to further exploit the long-term temporal clues. Consequently, the whole framework utilizes multi-stream CNNs to learn a hierarchical representation that captures spatial and temporal information of video. To validate its effectiveness in video analysis, we apply this video representation to action recognition task. We adopt a distribution-based fusion strategy to combine the decision scores from all the granularities, which are obtained by using a softmax layer on the top of each stream. We conduct extensive experiments on three action benchmarks (UCF101, HMDB51, and CCV) and achieve competitive performance against several state-of-the-art methods. © 2017, Springer-Verlag London.
引用
收藏
页码:85 / 98
页数:13
相关论文
共 50 条
  • [1] Spatiotemporal Saliency Representation Learning for Video Action Recognition
    Kong, Yongqiang
    Wang, Yunhong
    Li, Annan
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 1515 - 1528
  • [2] Nonnegative Component Representation with Hierarchical Dictionary Learning Strategy for Action Recognition
    Wang, Jianhong
    Zhang, Pinzheng
    Luo, Linmin
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2016, E99D (04): : 1259 - 1263
  • [3] Deep Video Understanding: Representation Learning, Action Recognition, and Language Generation
    Mei, Tao
    [J]. PROCEEDINGS OF THE 1ST WORKSHOP AND CHALLENGE ON COMPREHENSIVE VIDEO UNDERSTANDING IN THE WILD (COVIEW'18), 2018, : 1 - 1
  • [4] Hierarchical Posture Representation for Robust Action Recognition
    Chen, Yi
    Yu, Li
    Ota, Kaoru
    Dong, Mianxiong
    [J]. IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2019, 6 (05): : 1115 - 1125
  • [5] End-to-end Video-level Representation Learning for Action Recognition
    Zhu, Jiagang
    Zhu, Zheng
    Zou, Wei
    [J]. 2018 24TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2018, : 645 - 650
  • [6] Unsupervised Deep Learning of Mid-Level Video Representation for Action Recognition
    Hou, Jingyi
    Wu, Xinxiao
    Chen, Jin
    Luo, Jiebo
    Jia, Yunde
    [J]. THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 6910 - 6917
  • [7] Dynamic Representation Learning for Video Action Recognition Using Temporal Residual Networks
    Kong, Yongqiang
    Huang, Jianhui
    Huang, Shanshan
    Wei, Zhengang
    Wang, Shengke
    [J]. 2018 IEEE SMARTWORLD, UBIQUITOUS INTELLIGENCE & COMPUTING, ADVANCED & TRUSTED COMPUTING, SCALABLE COMPUTING & COMMUNICATIONS, CLOUD & BIG DATA COMPUTING, INTERNET OF PEOPLE AND SMART CITY INNOVATION (SMARTWORLD/SCALCOM/UIC/ATC/CBDCOM/IOP/SCI), 2018, : 331 - 337
  • [8] A Robust and Efficient Video Representation for Action Recognition
    Heng Wang
    Dan Oneata
    Jakob Verbeek
    Cordelia Schmid
    [J]. International Journal of Computer Vision, 2016, 119 : 219 - 238
  • [9] A Robust and Efficient Video Representation for Action Recognition
    Wang, Heng
    Oneata, Dan
    Verbeek, Jakob
    Schmid, Cordelia
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2016, 119 (03) : 219 - 238
  • [10] Exploring Multimodal Video Representation for Action Recognition
    Wang, Cheng
    Yang, Haojin
    Meinel, Christoph
    [J]. 2016 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2016, : 1924 - 1931