Temporal Sequence Distillation: Towards Few-Frame Action Recognition in Videos

被引:9
|
作者
Zhang, Zhaoyang [1 ,2 ]
Kuang, Zhanghui [2 ]
Luo, Ping [3 ]
Feng, Litong [2 ]
Zhang, Wei [2 ]
机构
[1] Wuhan Univ, Wuhan, Peoples R China
[2] SenseTime Res, Beijing, Peoples R China
[3] Chinese Univ Hong Kong, Hong Kong, Peoples R China
基金
中国国家自然科学基金;
关键词
Video Action Recognition; Temporal Sequence Distillation;
D O I
10.1145/3240508.3240534
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Video Analytics Software as a Service (VA SaaS) has been rapidly growing in recent years. VA SaaS is typically accessed by users using a lightweight client. Because the transmission bandwidth between the client and cloud is usually limited and expensive, it brings great benefits to design cloud video analysis algorithms with a limited data transmission requirement. Although considerable research has been devoted to video analysis, to our best knowledge, little of them has paid attention to the transmission bandwidth limitation in SaaS. As the first attempt in this direction, this work introduces a problem of few-frame action recognition, which aims at maintaining high recognition accuracy, when accessing only a few frames during both training and test. Unlike previous work that processed dense frames, we present Temporal Sequence Distillation (TSD), which distills a long video sequence into a very short one for transmission. By end-to-end training with 3D CNNs for video action recognition, TSD learns a compact and discriminative temporal and spatial representation of video frames. On Kinetics dataset, TSD+I3D typically requires only 50% of the number of frames compared to I3D [1], a state-of-the-art video action recognition algorithm, to achieve almost the same accuracies. The proposed TSD has three appealing advantages. Firstly, TSD has a lightweight architecture, and can be deployed in the client, e.g., mobile devices, to produce compressed representative frames to save transmission bandwidth. Secondly, TSD significantly reduces the computations to run video action recognition with compressed frames on the cloud, while maintaining high recognition accuracies. Thirdly, TSD can be plugged in as a preprocessing module of any existing 3D CNNs. Extensive experiments show the effectiveness and characteristics of TSD.
引用
收藏
页码:257 / 264
页数:8
相关论文
共 50 条
  • [11] Temporal Hallucinating for Action Recognition with Few Still Images
    Wang, Yali
    Zhou, Lei
    Qiao, Yu
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 5314 - 5322
  • [12] ProtoGAN: Towards Few Shot Learning for Action Recognition
    Dwivedi, Sai Kumar
    Gupta, Vikram
    Mitra, Rahul
    Ahmed, Shuaib
    Jain, Arjun
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 1308 - 1316
  • [13] A temporal belief filter improving human action recognition in videos
    Ramasso, Emmanuel
    Rombaut, Michele
    Pellerin, Denis
    2006 IEEE International Conference on Acoustics, Speech and Signal Processing, Vols 1-13, 2006, : 1389 - 1392
  • [14] Multi-Temporal Convolutions for Human Action Recognition in Videos
    Stergiou, Alexandros
    Poppe, Ronald
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [15] A Temporal Sequence Learning for Action Recognition and Prediction
    Cho, Sangwoo
    Foroosh, Hassan
    2018 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2018), 2018, : 352 - 361
  • [16] Human action recognition in drone videos using a few aerial training examples
    Sultani, Waqas
    Shah, Mubarak
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2021, 206
  • [17] Cross-domain few-shot action recognition with unlabeled videos
    Wang, Xiang
    Zhang, Shiwei
    Qing, Zhiwu
    Lv, Yiliang
    Gao, Changxin
    Sang, Nong
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2023, 233
  • [18] Zero and few shot action recognition in videos with caption semantic and generative assist
    Thrilokachandran G.
    Hosalli Ramappa M.
    International Journal of Information Technology, 2024, 16 (5) : 3121 - 3133
  • [19] Elastic temporal alignment for few-shot action recognition
    Pan, Fei
    Xu, Chunlei
    Zhang, Hongjie
    Guo, Jie
    Guo, Yanwen
    IET COMPUTER VISION, 2023, 17 (01) : 39 - 50
  • [20] CTC Network with Statistical Language Modeling for Action Sequence Recognition in Videos
    Lin, Mengxi
    Inoue, Nakamasa
    Shinoda, Koichi
    PROCEEDINGS OF THE THEMATIC WORKSHOPS OF ACM MULTIMEDIA 2017 (THEMATIC WORKSHOPS'17), 2017, : 393 - 401