Temporal Sequence Distillation: Towards Few-Frame Action Recognition in Videos

被引:9
|
作者
Zhang, Zhaoyang [1 ,2 ]
Kuang, Zhanghui [2 ]
Luo, Ping [3 ]
Feng, Litong [2 ]
Zhang, Wei [2 ]
机构
[1] Wuhan Univ, Wuhan, Peoples R China
[2] SenseTime Res, Beijing, Peoples R China
[3] Chinese Univ Hong Kong, Hong Kong, Peoples R China
基金
中国国家自然科学基金;
关键词
Video Action Recognition; Temporal Sequence Distillation;
D O I
10.1145/3240508.3240534
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Video Analytics Software as a Service (VA SaaS) has been rapidly growing in recent years. VA SaaS is typically accessed by users using a lightweight client. Because the transmission bandwidth between the client and cloud is usually limited and expensive, it brings great benefits to design cloud video analysis algorithms with a limited data transmission requirement. Although considerable research has been devoted to video analysis, to our best knowledge, little of them has paid attention to the transmission bandwidth limitation in SaaS. As the first attempt in this direction, this work introduces a problem of few-frame action recognition, which aims at maintaining high recognition accuracy, when accessing only a few frames during both training and test. Unlike previous work that processed dense frames, we present Temporal Sequence Distillation (TSD), which distills a long video sequence into a very short one for transmission. By end-to-end training with 3D CNNs for video action recognition, TSD learns a compact and discriminative temporal and spatial representation of video frames. On Kinetics dataset, TSD+I3D typically requires only 50% of the number of frames compared to I3D [1], a state-of-the-art video action recognition algorithm, to achieve almost the same accuracies. The proposed TSD has three appealing advantages. Firstly, TSD has a lightweight architecture, and can be deployed in the client, e.g., mobile devices, to produce compressed representative frames to save transmission bandwidth. Secondly, TSD significantly reduces the computations to run video action recognition with compressed frames on the cloud, while maintaining high recognition accuracies. Thirdly, TSD can be plugged in as a preprocessing module of any existing 3D CNNs. Extensive experiments show the effectiveness and characteristics of TSD.
引用
收藏
页码:257 / 264
页数:8
相关论文
共 50 条
  • [1] Hybrid embedding for multimodal few-frame action recognition
    Shafizadegan, Fatemeh
    Naghsh-Nilchi, Ahmad Reza
    Shabaninia, Elham
    MULTIMEDIA SYSTEMS, 2025, 31 (02)
  • [2] FTAN: Frame-to-frame temporal alignment network with contrastive learning for few-shot action recognition
    Yu, Bin
    Hou, Yonghong
    Guo, Zihui
    Gao, Zhiyi
    Li, Yueyang
    IMAGE AND VISION COMPUTING, 2024, 149
  • [3] Analysis of Temporal Coherence in Videos for Action Recognition
    Saleh, Adel
    Abdel-Nasser, Mohamed
    Akram, Farhan
    Garcia, Miguel Angel
    Puig, Domenec
    IMAGE ANALYSIS AND RECOGNITION (ICIAR 2016), 2016, 9730 : 325 - 332
  • [4] Action Recognition in Videos with Temporal Segments Fusions
    Fang, Yuanye
    Zhang, Rui
    Wang, Qiu-Feng
    Huang, Kaizhu
    ADVANCES IN BRAIN INSPIRED COGNITIVE SYSTEMS, 2020, 11691 : 244 - 253
  • [5] Temporal Segment Networks for Action Recognition in Videos
    Wang, Limin
    Xiong, Yuanjun
    Wang, Zhe
    Qiao, Yu
    Lin, Dahua
    Tang, Xiaoou
    Van Gool, Luc
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (11) : 2740 - 2755
  • [6] Action density based frame sampling for human action recognition in videos
    Lin, Jie
    Mu, Zekun
    Zhao, Tianqing
    Zhang, Hanlin
    Yang, Xinyu
    Zhao, Peng
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2023, 90
  • [7] SoccerKDNet: A Knowledge Distillation Framework for Action Recognition in Soccer Videos
    Bose, Sarosij
    Sarkar, Saikat
    Chakrabarti, Amlan
    PATTERN RECOGNITION AND MACHINE INTELLIGENCE, PREMI 2023, 2023, 14301 : 457 - 464
  • [8] Spatial-temporal pooling for action recognition in videos
    Wang, Jiaming
    Shao, Zhenfeng
    Huang, Xiao
    Lu, Tao
    Zhang, Ruiqian
    Lv, Xianwei
    NEUROCOMPUTING, 2021, 451 : 265 - 278
  • [9] Detecting Hands in Egocentric Videos: Towards Action Recognition
    Cartas, Alejandro
    Dimiccoli, Mariella
    Radeva, Petia
    COMPUTER AIDED SYSTEMS THEORY - EUROCAST 2017, PT II, 2018, 10672 : 330 - 338
  • [10] Commonsense Knowledge Prompting for Few-Shot Action Recognition in Videos
    Shi, Yuheng
    Wu, Xinxiao
    Lin, Hanxi
    Luo, Jiebo
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 8395 - 8405