Action Keypoint Network for Efficient Video Recognition

被引：2

作者：

Chen, Xu ^{[1
,2
]}

Han, Yahong ^{[1
,2
,3
]}

Wang, Xiaohan ^{[4
]}

Sun, Yifan ^{[5
]}

Yang, Yi ^{[4
]}

机构：

[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin 300072, Peoples R China

[2] Tianjin Univ, Tianjin Key Lab Machine Learning, Tianjin 300072, Peoples R China

[3] Peng Cheng Lab, Shenzhen 518066, Peoples R China

[4] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou 310000, Peoples R China

[5] Baidu Res, Beijing 100000, Peoples R China

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2022年 / 31卷

关键词：

Video recognition; space-time interest points; deep learning; point cloud;

D O I：

10.1109/TIP.2022.3191461

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Reducing redundancy is crucial for improving the efficiency of video recognition models. An effective approach is to select informative content from the holistic video, yielding a popular family of dynamic video recognition methods. However, existing dynamic methods focus on either temporal or spatial selection independently while neglecting a reality that the redundancies are usually spatial and temporal, simultaneously. Moreover, their selected content is usually cropped with fixed shapes (e.g., temporally-cropped frames, spatially-cropped patches), while the realistic distribution of informative content can be much more diverse. With these two insights, this paper proposes to integrate temporal and spatial selection into an Action Keypoint Network (AK-Net). From different frames and positions, AK-Net selects some informative points scattered in arbitrary-shaped regions as a set of "action keypoints" and then transforms the video recognition into point cloud classification. More concretely, AK-Net has two steps, i.e., the keypoint selection and the point cloud classification. First, it inputs the video into a baseline network and outputs a feature map from an intermediate layer. We view each pixel on this feature map as a spatial-temporal point and select some informative keypoints using self-attention. Second, AK-Net devises a ranking criterion to arrange the keypoints into an ordered 1D sequence. Since the video is represented with a 1D sequence after the specified layer, AK-Net transforms the subsequent layers into a point cloud classification sub-net by compacting the original 2D convolutional kernels into 1D kernels. Consequentially, AK-Net brings two-fold benefits for efficiency: The keypoint selection step collects informative content within arbitrary shapes and increases the efficiency for modeling spatial-temporal dependencies, while the point cloud classification step further reduces the computational cost by compacting the convolutional kernels. Experimental results show that AK-Net can consistently improve the efficiency and performance of baseline methods on several video recognition benchmarks.

引用

页码：4980 / 4993

页数：14

共 50 条

[1] FREQUENCY ENHANCEMENT NETWORK FOR EFFICIENT COMPRESSED VIDEO ACTION RECOGNITION
Ming, Yue
Xiong, Lu
Jia, Xia
Zheng, Qingfang
Zhou, Jiangwan
Feng, Fan
Hu, Nannan
2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 825 - 829
[2] LAE-Net: Light and Efficient Network for Compressed Video Action Recognition
Guo, Jinxin
Zhang, Jiaqiang
Zhang, Xiaojing
Ma, Ming
MULTIMEDIA MODELING, MMM 2023, PT II, 2023, 13834 : 265 - 276
[3] Efficient 2D Temporal Modeling Network for Video Action Recognition
Li, Zhilei
Li, Jun
Shi, Zhiping
Jiang, Na
Zhang, Yongkang
Computer Engineering and Applications, 2024, 59 (03) : 127 - 134
[4] Whole-Body Keypoint and Skeleton Augmented RGB Networks for Video Action Recognition
Guo, Zizhao
Ying, Sancong
APPLIED SCIENCES-BASEL, 2022, 12 (12):
[5] A Robust and Efficient Video Representation for Action Recognition
Wang, Heng
Oneata, Dan
Verbeek, Jakob
Schmid, Cordelia
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2016, 119 (03) : 219 - 238
[6] A Robust and Efficient Video Representation for Action Recognition
Heng Wang
Dan Oneata
Jakob Verbeek
Cordelia Schmid
International Journal of Computer Vision, 2016, 119 : 219 - 238
[7] FENet: An Efficient Feature Excitation Network for Video-based Human Action Recognition
Zhang, Zhan
Jin, Yi
Feng, Songhe
Li, Yidong
Wang, Tao
Tian, Hui
2022 16TH IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING (ICSP2022), VOL 1, 2022, : 540 - 544
[8] AE-Net:Adjoint Enhancement Network for Efficient Action Recognition in Video Understanding
Wang, Bin
Liu, Chunsheng
Chang, Faliang
Wang, Wenqian
Li, Nanjun
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 5458 - 5468
[9] Binary Neural Network for Video Action Recognition
Han, Hongfeng
Lu, Zhiwu
Wen, Ji-Rong
MULTIMEDIA MODELING, MMM 2023, PT I, 2023, 13833 : 95 - 106
[10] Spatiotemporal Pyramid Network for Video Action Recognition
Wang, Yunbo
Long, Mingsheng
Wang, Jianmin
Yu, Philip S.
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 2097 - 2106

← 1 2 3 4 5 →