Temporal Segment Networks for Action Recognition in Videos

被引:569
|
作者
Wang, Limin [1 ]
Xiong, Yuanjun [2 ]
Wang, Zhe [3 ]
Qiao, Yu [4 ]
Lin, Dahua [5 ]
Tang, Xiaoou [5 ]
Van Gool, Luc [6 ]
机构
[1] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing 210023, Jiangsu, Peoples R China
[2] Amazon Web Serv, Seattle, WA 98101 USA
[3] Univ Calif Irvine, Dept Comp Sci, Irvine, CA 92697 USA
[4] Chinese Acad Sci, Shenzhen Inst Adv Technol, Shenzhen 518055, Peoples R China
[5] Chinese Univ Hong Kong, Dept Informat Engn, Shatin, Hong Kong, Peoples R China
[6] Swiss Fed Inst Technol, Comp Vis Lab, CH-8092 Zurich, Switzerland
基金
美国国家科学基金会;
关键词
Action recognition; temporal segment networks; temporal modeling; good practices; ConvNets; REPRESENTATION; VECTOR;
D O I
10.1109/TPAMI.2018.2868668
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a general and flexible video-level framework for learning action models in videos. This method, called temporal segment network (TSN), aims to model long-range temporal structure with a new segment-based sampling and aggregation scheme. This unique design enables the TSN framework to efficiently learn action models by using the whole video. The learned models could be easily deployed for action recognition in both trimmed and untrimmed videos with simple average pooling and multi-scale temporal window integration, respectively. We also study a series of good practices for the implementation of the TSN framework given limited training samples. Our approach obtains the state-the-of-art performance on five challenging action recognition benchmarks: HMDB51 (71.0 percent), UCF101 (94.9 percent), THUMOS14 (80.1 percent), ActivityNet v1.2 (89.6 percent), and Kinetics400 (75.7 percent). In addition, using the proposed RGB difference as a simple motion representation, our method can still achieve competitive accuracy on UCF101 (91.0 percent) while running at 340 FPS. Furthermore, based on the proposed TSN framework, we won the video classification track at the ActivityNet challenge 2016 among 24 teams.
引用
收藏
页码:2740 / 2755
页数:16
相关论文
共 50 条
  • [31] Temporal Bilinear Networks for Video Action Recognition
    Li, Yanghao
    Song, Sijie
    Li, Yuqi
    Liu, Jiaying
    [J]. THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 8674 - 8681
  • [32] Temporal Difference Networks for Video Action Recognition
    Ng, Joe Yue-Hei
    Davis, Larry S.
    [J]. 2018 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2018), 2018, : 1577 - 1586
  • [33] Parallelizing Convolutional Neural Networks for Action Event Recognition in Surveillance Videos
    Wang, Qicong
    Zhao, Jinhao
    Gong, Dingxi
    Shen, Yehu
    Li, Maozhen
    Lei, Yunqi
    [J]. INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2017, 45 (04) : 734 - 759
  • [34] Parallelizing Convolutional Neural Networks for Action Event Recognition in Surveillance Videos
    Qicong Wang
    Jinhao Zhao
    Dingxi Gong
    Yehu Shen
    Maozhen Li
    Yunqi Lei
    [J]. International Journal of Parallel Programming, 2017, 45 : 734 - 759
  • [35] Human action recognition in videos with articulated pose information by deep networks
    Farrajota, M.
    Rodrigues, Joao M. F.
    du Buf, J. M. H.
    [J]. PATTERN ANALYSIS AND APPLICATIONS, 2019, 22 (04) : 1307 - 1318
  • [36] Convolutional Networks With Channel and STIPs Attention Model for Action Recognition in Videos
    Wu, Hanbo
    Ma, Xin
    Li, Yibin
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (09) : 2293 - 2306
  • [37] Human action recognition in videos with articulated pose information by deep networks
    M. Farrajota
    João M. F. Rodrigues
    J. M. H. du Buf
    [J]. Pattern Analysis and Applications, 2019, 22 : 1307 - 1318
  • [38] An Overview of Action Recognition in Videos
    Buric, M.
    Pobar, M.
    Kos, M. Ivasic
    [J]. 2017 40TH INTERNATIONAL CONVENTION ON INFORMATION AND COMMUNICATION TECHNOLOGY, ELECTRONICS AND MICROELECTRONICS (MIPRO), 2017, : 1098 - 1103
  • [39] Feature Aggregation Tree: Capture Temporal Motion Information for Action Recognition in Videos
    Zhu, Bing
    [J]. PATTERN RECOGNITION AND COMPUTER VISION, PT III, 2018, 11258 : 316 - 327
  • [40] Temporal Sequence Distillation: Towards Few-Frame Action Recognition in Videos
    Zhang, Zhaoyang
    Kuang, Zhanghui
    Luo, Ping
    Feng, Litong
    Zhang, Wei
    [J]. PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 257 - 264