Recurrent Region Attention and Video Frame Attention Based Video Action Recognition Network Design

被引：0

作者：

Sang H.-F. ^{[1
]}

Zhao Z.-Y. ^{[1
]}

He D.-K. ^{[2
]}

机构：

[1] School of Information Science & Engineering, Shenyang University of Technology, Shenyang, 110870, Liaoning

[2] College of Information Science & Engineering, Northeastern University, Shenyang, 110819, Liaoning

来源：

Zhao, Zi-Yu (Maikuraky1022@outlook.com) | 1600年 / Chinese Institute of Electronics卷 / 48期

关键词：

Action recognition; Recurrent neural network; Recurrent region attention; Video frame attention;

D O I：

10.3969/j.issn.0372-2112.2020.06.002

中图分类号：

TN94 [电视];

学科分类号：

0810 ; 081001 ;

摘要：

In video frames, the complex environment background, lighting conditions and other visual information unrelated to action bring a lot of redundancy and noise to action spatial feature, which affects the accuracy of action recognition to some extent. In view of this, this paper proposes a recurrent region attention cell to capture the visual information of the region related to the action in spatial features. Based on the sequence nature of video, a recurrent region attention model (RRA) is proposed. Secondly, this paper proposes a video frame attention model (VFA) that can highlight the more important frames in the video sequence of the whole action, so as to reduce the interference brought by the similar before and after correlation between video sequences of different actions. Finally, this paper presents a network model which can perform end-to-end training: recurrent region attention and video frame attention based video action recognition network (RFANet). Experiments on two video action recognition benchmark UCF101 dataset and HMDB51 dataset show that the RFANet proposed in this paper can reliably identify the category of action in the video. Inspired by the two-stream structure, we construct a two-modalities RFANet network. In the same training conditions, the two-modalities RFANet network achieved optimal performance on both datasets. © 2020, Chinese Institute of Electronics. All right reserved.

引用

页码：1052 / 1061

页数：9

共 25 条

[1] Ioffe S, Szegedy C., Batch normalization: accelerating deep network training by reducing internal covariate shift, International Conference on Machine Learning, pp. 448-456, (2015)
[2] Wang L, Xiong Y, Wang Z, Et al., Temporal segment networks: towards good practices for deep action recognition, European Conference on Computer Vision, pp. 20-36, (2016)
[3] Tran D, Bourdev L, Fergus R, Et al., Learning spatiotemporal features with 3D convolutional networks, International Conference on Computer Vision, pp. 4489-4497, (2015)
[4] Hochreiter S, Schmidhuber J., Longshort-term memory, Neural Computation, 9, 8, pp. 1735-1780, (1997)
[5] Donahue J, Hendricks L A, Guadarrama S, Et al., Long-termrecurrent convolutional networks for visual recognition and description, IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 4, pp. 677-691, (2017)
[6] Brox T, Bruhn A, Papenberg N, Et al., High accuracy optical flow estimation based on a theory for warping, Computer Vision, 3024, 10, pp. 25-36, (2004)
[7] Xu K, Ba J, Kiros R, Et al., Show, attend and tell: neural image caption generation with visual attention, International Conference on Machine Learning, pp. 2048-2057, (2015)
[8] Sharma S, Kiros R, Salakhutdinov R., Action Recognition Using Visual Attention
[9] Yan S, Smith J S, Lu W, Et al., Hierarchical multi-scale attention networks for action recognition, Signal Processing: Image Communication, 61, pp. 73-84, (2018)
[10] Yu T, Guo C, Wang L, Et al., Joint spatial-temporal attention for action recognition, Computer Science, 112, 2018, pp. 226-233, (2018)

← 1 2 3 →