Temporal Sequence Distillation: Towards Few-Frame Action Recognition in Videos

被引：9

作者：

Zhang, Zhaoyang ^{[1
,2
]}

Kuang, Zhanghui ^{[2
]}

Luo, Ping ^{[3
]}

Feng, Litong ^{[2
]}

Zhang, Wei ^{[2
]}

机构：

[1] Wuhan Univ, Wuhan, Peoples R China

[2] SenseTime Res, Beijing, Peoples R China

[3] Chinese Univ Hong Kong, Hong Kong, Peoples R China

来源：

PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18) | 2018年

基金：

中国国家自然科学基金;

关键词：

Video Action Recognition; Temporal Sequence Distillation;

D O I：

10.1145/3240508.3240534

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Video Analytics Software as a Service (VA SaaS) has been rapidly growing in recent years. VA SaaS is typically accessed by users using a lightweight client. Because the transmission bandwidth between the client and cloud is usually limited and expensive, it brings great benefits to design cloud video analysis algorithms with a limited data transmission requirement. Although considerable research has been devoted to video analysis, to our best knowledge, little of them has paid attention to the transmission bandwidth limitation in SaaS. As the first attempt in this direction, this work introduces a problem of few-frame action recognition, which aims at maintaining high recognition accuracy, when accessing only a few frames during both training and test. Unlike previous work that processed dense frames, we present Temporal Sequence Distillation (TSD), which distills a long video sequence into a very short one for transmission. By end-to-end training with 3D CNNs for video action recognition, TSD learns a compact and discriminative temporal and spatial representation of video frames. On Kinetics dataset, TSD+I3D typically requires only 50% of the number of frames compared to I3D [1], a state-of-the-art video action recognition algorithm, to achieve almost the same accuracies. The proposed TSD has three appealing advantages. Firstly, TSD has a lightweight architecture, and can be deployed in the client, e.g., mobile devices, to produce compressed representative frames to save transmission bandwidth. Secondly, TSD significantly reduces the computations to run video action recognition with compressed frames on the cloud, while maintaining high recognition accuracies. Thirdly, TSD can be plugged in as a preprocessing module of any existing 3D CNNs. Extensive experiments show the effectiveness and characteristics of TSD.

引用

页码：257 / 264

页数：8

共 50 条

[41] Aggregating the temporal coherent descriptors in videos using multiple learning kernel for action recognition
Saleh, Adel
Abdel-Nasser, Mohamed
Angel Garcia, Miguel
Puig, Domenec
PATTERN RECOGNITION LETTERS, 2018, 105 : 4 - 12
[42] A Spatio-Temporal Deep Learning Approach For Human Action Recognition in Infrared Videos
Shah, Anuj K.
Ghosh, Ripul
Akula, Aparna
OPTICS AND PHOTONICS FOR INFORMATION PROCESSING XII, 2018, 10751
[43] Spatio-Temporal Vector of Locally Max Pooled Features for Action Recognition in Videos
Duta, Ionut Cosmin
Ionescu, Bogdan
Aizawa, Kiyoharu
Sebe, Nicu
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 3205 - 3214
[44] Learning 3D Action Models from a few 2D videos for View Invariant Action Recognition
Natarajan, Pradeep
Singh, Vivek Kumar
Nevatia, Ram
2010 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2010, : 2006 - 2013
[45] Spatio-Temporal Self-supervision for Few-Shot Action Recognition
Yu, Wanchuan
Guo, Hanyu
Yan, Yan
Li, Jie
Wang, Hanzi
PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT I, 2024, 14425 : 84 - 96
[46] MULTI-SCALE TEMPORAL FEATURE FUSION FOR FEW-SHOT ACTION RECOGNITION
Lee, Jun-Tae
Yun, Sungrack
2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 1785 - 1789
[47] Temporal-Viewpoint Transportation Plan for Skeletal Few-Shot Action Recognition
Wang, Lei
Koniusz, Piotr
COMPUTER VISION - ACCV 2022, PT IV, 2023, 13844 : 307 - 326
[48] Searching for Better Spatio-temporal Alignment in Few-Shot Action Recognition
Cao, Yichao
Su, Xiu
Tang, Qingfei
You, Shan
Lu, Xiaobo
Xu, Chang
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[49] Few-shot action recognition with implicit temporal alignment and pair similarity optimization
Cao, Congqi
Li, Yajuan
Lv, Qinyi
Wang, Peng
Zhang, Yanning
COMPUTER VISION AND IMAGE UNDERSTANDING, 2021, 210
[50] Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
Wang, Limin
Xiong, Yuanjun
Wang, Zhe
Qiao, Yu
Lin, Dahua
Tang, Xiaoou
Van Gool, Luc
COMPUTER VISION - ECCV 2016, PT VIII, 2016, 9912 : 20 - 36

← 1 2 3 4 5 →