Learning hierarchical video representation for action recognition

被引：15

作者：

Li Q. ^{[1
]}

Qiu Z. ^{[1
]}

Yao T. ^{[2
]}

Mei T. ^{[2
]}

Rui Y. ^{[2
]}

Luo J. ^{[3
]}

机构：

[1] University of Science and Technology of China, Hefei

[2] Microsoft Research, Beijing

[3] University of Rochester, New York

来源：

International Journal of Multimedia Information Retrieval | 2017年 / 6卷 / 1期

关键词：

Action recognition; Deep learning; Video representation learning;

D O I：

10.1007/s13735-016-0117-4

中图分类号：

学科分类号：

摘要：

Video analysis is an important branch of computer vision due to its wide applications, ranging from video surveillance, video indexing, and retrieval to human computer interaction. All of the applications are based on a good video representation, which encodes video content into a feature vector with fixed length. Most existing methods treat video as a flat image sequence, but from our observations we argue that video is an information-intensive media with intrinsic hierarchical structure, which is largely ignored by previous approaches. Therefore, in this work, we represent the hierarchical structure of video with multiple granularities including, from short to long, single frame, consecutive frames (motion), short clip, and the entire video. Furthermore, we propose a novel deep learning framework to model each granularity individually. Specifically, we model the frame and motion granularities with 2D convolutional neural networks and model the clip and video granularities with 3D convolutional neural networks. Long Short-Term Memory networks are applied on the frame, motion, and clip to further exploit the long-term temporal clues. Consequently, the whole framework utilizes multi-stream CNNs to learn a hierarchical representation that captures spatial and temporal information of video. To validate its effectiveness in video analysis, we apply this video representation to action recognition task. We adopt a distribution-based fusion strategy to combine the decision scores from all the granularities, which are obtained by using a softmax layer on the top of each stream. We conduct extensive experiments on three action benchmarks (UCF101, HMDB51, and CCV) and achieve competitive performance against several state-of-the-art methods. © 2017, Springer-Verlag London.

引用

页码：85 / 98

页数：13

共 50 条

[1] Spatiotemporal Saliency Representation Learning for Video Action Recognition
Kong, Yongqiang
Wang, Yunhong
Li, Annan
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 1515 - 1528
[2] Nonnegative Component Representation with Hierarchical Dictionary Learning Strategy for Action Recognition
Wang, Jianhong
Zhang, Pinzheng
Luo, Linmin
[J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2016, E99D (04): : 1259 - 1263
[3] Deep Video Understanding: Representation Learning, Action Recognition, and Language Generation
Mei, Tao
[J]. PROCEEDINGS OF THE 1ST WORKSHOP AND CHALLENGE ON COMPREHENSIVE VIDEO UNDERSTANDING IN THE WILD (COVIEW'18), 2018, : 1 - 1
[4] Hierarchical Posture Representation for Robust Action Recognition
Chen, Yi
Yu, Li
Ota, Kaoru
Dong, Mianxiong
[J]. IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2019, 6 (05): : 1115 - 1125
[5] End-to-end Video-level Representation Learning for Action Recognition
Zhu, Jiagang
Zhu, Zheng
Zou, Wei
[J]. 2018 24TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2018, : 645 - 650
[6] Unsupervised Deep Learning of Mid-Level Video Representation for Action Recognition
Hou, Jingyi
Wu, Xinxiao
Chen, Jin
Luo, Jiebo
Jia, Yunde
[J]. THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 6910 - 6917
[7] Dynamic Representation Learning for Video Action Recognition Using Temporal Residual Networks
Kong, Yongqiang
Huang, Jianhui
Huang, Shanshan
Wei, Zhengang
Wang, Shengke
[J]. 2018 IEEE SMARTWORLD, UBIQUITOUS INTELLIGENCE & COMPUTING, ADVANCED & TRUSTED COMPUTING, SCALABLE COMPUTING & COMMUNICATIONS, CLOUD & BIG DATA COMPUTING, INTERNET OF PEOPLE AND SMART CITY INNOVATION (SMARTWORLD/SCALCOM/UIC/ATC/CBDCOM/IOP/SCI), 2018, : 331 - 337
[8] A Robust and Efficient Video Representation for Action Recognition
Heng Wang
Dan Oneata
Jakob Verbeek
Cordelia Schmid
[J]. International Journal of Computer Vision, 2016, 119 : 219 - 238
[9] A Robust and Efficient Video Representation for Action Recognition
Wang, Heng
Oneata, Dan
Verbeek, Jakob
Schmid, Cordelia
[J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2016, 119 (03) : 219 - 238
[10] Exploring Multimodal Video Representation for Action Recognition
Wang, Cheng
Yang, Haojin
Meinel, Christoph
[J]. 2016 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2016, : 1924 - 1931

← 1 2 3 4 5 →