Action Recognition Based on Feature Interaction and Clustering

被引:0
|
作者
Li K. [1 ]
Cai P. [1 ]
Zhou Z. [1 ]
机构
[1] State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing
关键词
action recognition; feature clustering; feature interaction; spatiotemporal feature relationship;
D O I
10.3724/SP.J.1089.2023.19493
中图分类号
学科分类号
摘要
To mitigate the problem that the action recognition methods lack the modeling of spatiotemporal feature relationship, an action recognition method based on feature interaction and clustering is proposed. Firstly, a mixed multi-scale feature extraction network is designed to extract spatial and temporal features of continuous frames. Secondly, a feature interaction module is designed based on non-local operation to realize spatiotemporal feature interaction. Finally, based on the triplet loss function, a hard sample selection strategy is designed to train the recognition network, thus realizing spatiotemporal feature clustering and improving the robustness and discrimination of the features. Experimental results show that compared with TSN, the accuracy of on the UCF101 dataset is increased by 23.25 percentage points to 94.82%. On the HMDB51 dataset, the accuracy is increased by 20.27 percentage points to 44.03%. © 2023 Institute of Computing Technology. All rights reserved.
引用
收藏
页码:903 / 914
页数:11
相关论文
共 31 条
  • [1] Herath S, Harandi M, Porikli F., Going deeper into action recognition: a survey, Image and Vision Computing, 60, pp. 4-21, (2017)
  • [2] Wang X L, Girshick R, Gupta A, Et al., Non-local neural networks, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7794-7803, (2018)
  • [3] Schroff F, Kalenichenko D, Philbin J., Facenet: a unified embedding for face recognition and clustering, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815-823, (2015)
  • [4] Wang L M, Xiong Y J, Wang Z, Et al., Temporal segment networks: towards good practices for deep action recognition, Proceedings of the European Conference on Computer Vision, pp. 20-36, (2016)
  • [5] Laptev I, Marszalek M, Schmid C, Et al., Learning realistic human actions from movies, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1-8, (2008)
  • [6] Wang H, Schmid C., Action recognition with improved trajectories, Proceedings of the IEEE International Conference on Computer Vision, pp. 3551-3558, (2013)
  • [7] Donahue J, Hendricks L A, Guadarrama S, Et al., Long-term recurrent convolutional networks for visual recognition and description, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625-2634, (2015)
  • [8] Ji S W, Xu W, Yang M, Et al., 3D convolutional neural networks for human action recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, 35, 1, pp. 221-231, (2013)
  • [9] Tran D, Bourdev L, Fergus R, Et al., Learning spatiotemporal features with 3D convolutional networks, Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497, (2015)
  • [10] Tran D, Wang H, Torresani L, Et al., A closer look at spatiotemporal convolutions for action recognition, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6450-6459, (2018)