Temporal Sequence Distillation: Towards Few-Frame Action Recognition in Videos

被引：9

作者：

Zhang, Zhaoyang ^{[1
,2
]}

Kuang, Zhanghui ^{[2
]}

Luo, Ping ^{[3
]}

Feng, Litong ^{[2
]}

Zhang, Wei ^{[2
]}

机构：

[1] Wuhan Univ, Wuhan, Peoples R China

[2] SenseTime Res, Beijing, Peoples R China

[3] Chinese Univ Hong Kong, Hong Kong, Peoples R China

来源：

PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18) | 2018年

基金：

中国国家自然科学基金;

关键词：

Video Action Recognition; Temporal Sequence Distillation;

D O I：

10.1145/3240508.3240534

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Video Analytics Software as a Service (VA SaaS) has been rapidly growing in recent years. VA SaaS is typically accessed by users using a lightweight client. Because the transmission bandwidth between the client and cloud is usually limited and expensive, it brings great benefits to design cloud video analysis algorithms with a limited data transmission requirement. Although considerable research has been devoted to video analysis, to our best knowledge, little of them has paid attention to the transmission bandwidth limitation in SaaS. As the first attempt in this direction, this work introduces a problem of few-frame action recognition, which aims at maintaining high recognition accuracy, when accessing only a few frames during both training and test. Unlike previous work that processed dense frames, we present Temporal Sequence Distillation (TSD), which distills a long video sequence into a very short one for transmission. By end-to-end training with 3D CNNs for video action recognition, TSD learns a compact and discriminative temporal and spatial representation of video frames. On Kinetics dataset, TSD+I3D typically requires only 50% of the number of frames compared to I3D [1], a state-of-the-art video action recognition algorithm, to achieve almost the same accuracies. The proposed TSD has three appealing advantages. Firstly, TSD has a lightweight architecture, and can be deployed in the client, e.g., mobile devices, to produce compressed representative frames to save transmission bandwidth. Secondly, TSD significantly reduces the computations to run video action recognition with compressed frames on the cloud, while maintaining high recognition accuracies. Thirdly, TSD can be plugged in as a preprocessing module of any existing 3D CNNs. Extensive experiments show the effectiveness and characteristics of TSD.

引用

页码：257 / 264

页数：8

共 50 条

[1] Hybrid embedding for multimodal few-frame action recognition
Shafizadegan, Fatemeh
Naghsh-Nilchi, Ahmad Reza
Shabaninia, Elham
MULTIMEDIA SYSTEMS, 2025, 31 (02)
[2] FTAN: Frame-to-frame temporal alignment network with contrastive learning for few-shot action recognition
Yu, Bin
Hou, Yonghong
Guo, Zihui
Gao, Zhiyi
Li, Yueyang
IMAGE AND VISION COMPUTING, 2024, 149
[3] Analysis of Temporal Coherence in Videos for Action Recognition
Saleh, Adel
Abdel-Nasser, Mohamed
Akram, Farhan
Garcia, Miguel Angel
Puig, Domenec
IMAGE ANALYSIS AND RECOGNITION (ICIAR 2016), 2016, 9730 : 325 - 332
[4] Action Recognition in Videos with Temporal Segments Fusions
Fang, Yuanye
Zhang, Rui
Wang, Qiu-Feng
Huang, Kaizhu
ADVANCES IN BRAIN INSPIRED COGNITIVE SYSTEMS, 2020, 11691 : 244 - 253
[5] Temporal Segment Networks for Action Recognition in Videos
Wang, Limin
Xiong, Yuanjun
Wang, Zhe
Qiao, Yu
Lin, Dahua
Tang, Xiaoou
Van Gool, Luc
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (11) : 2740 - 2755
[6] Action density based frame sampling for human action recognition in videos
Lin, Jie
Mu, Zekun
Zhao, Tianqing
Zhang, Hanlin
Yang, Xinyu
Zhao, Peng
JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2023, 90
[7] SoccerKDNet: A Knowledge Distillation Framework for Action Recognition in Soccer Videos
Bose, Sarosij
Sarkar, Saikat
Chakrabarti, Amlan
PATTERN RECOGNITION AND MACHINE INTELLIGENCE, PREMI 2023, 2023, 14301 : 457 - 464
[8] Spatial-temporal pooling for action recognition in videos
Wang, Jiaming
Shao, Zhenfeng
Huang, Xiao
Lu, Tao
Zhang, Ruiqian
Lv, Xianwei
NEUROCOMPUTING, 2021, 451 : 265 - 278
[9] Detecting Hands in Egocentric Videos: Towards Action Recognition
Cartas, Alejandro
Dimiccoli, Mariella
Radeva, Petia
COMPUTER AIDED SYSTEMS THEORY - EUROCAST 2017, PT II, 2018, 10672 : 330 - 338
[10] Commonsense Knowledge Prompting for Few-Shot Action Recognition in Videos
Shi, Yuheng
Wu, Xinxiao
Lin, Hanxi
Luo, Jiebo
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 8395 - 8405

← 1 2 3 4 5 →