DSANet: Dynamic Segment Aggregation Network for Video-Level Representation Learning

被引:4
|
作者
Wu, Wenhao [1 ]
Zhao, Yuxiang [1 ,2 ]
Xu, Yanwu [3 ]
Tan, Xiao [1 ]
He, Dongliang [1 ]
Zou, Zhikang [1 ]
Ye, Jin [1 ]
Li, Yingying [1 ]
Yao, Mingde [1 ]
Dong, Zichao [1 ]
Shi, Yifeng [1 ]
机构
[1] Baidu Inc, Beijing, Peoples R China
[2] Chinese Acad Sci, Shenzhen Inst Adv Technol, Shenzhen, Peoples R China
[3] Univ Pittsburgh, Pittsburgh, PA 15260 USA
关键词
neural networks; action recognition; video representation learning;
D O I
10.1145/3474085.3475344
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Long-range and short-range temporal modeling are two complementary and crucial aspects of video recognition. Most of the stateof-the-arts focus on short-range spatio-temporal modeling and then average multiple snippet-level predictions to yield the final videolevel prediction. Thus, their video-level prediction does not consider spatio-temporal features of how video evolves along the temporal dimension. In this paper, we introduce a novel Dynamic Segment Aggregation (DSA) module to capture relationship among snippets. To be more specific, we attempt to generate a dynamic kernel for a convolutional operation to aggregate long-range temporal information among adjacent snippets adaptively. The DSA module is an efficient plug-and-play module and can be combined with the off-the-shelf clip-based models (i.e., TSM, I3D) to perform powerful long-range modeling with minimal overhead. The final video architecture, coined as DSANet. We conduct extensive experiments on several video recognition benchmarks (i.e., Mini-Kinetics-200, Kinetics-400, Something-Something V1 and ActivityNet) to show its superiority. Our proposed DSA module is shown to benefit various video recognition models significantly. For example, equipped with DSA modules, the top-1 accuracy of I3D ResNet-50 is improved from 74.9% to 78.2% on Kinetics-400. Codes are available at https://github.com/whwu95/DSANet.
引用
收藏
页码:1903 / 1911
页数:9
相关论文
共 50 条
  • [31] Temporal resonant graph network for representation learning on dynamic graphs
    Zidu Yin
    Kun Yue
    Applied Intelligence, 2023, 53 : 7466 - 7483
  • [32] Dynamic Network Representation Learning Based on Hawkes Point Process
    Yin Y.
    Zhang J.-P.
    Ji L.-X.
    Li Z.-C.
    Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2020, 48 (11): : 2154 - 2161
  • [33] Deep Learning of Segment-Level Feature Representation with Multiple Instance Learning for Utterance-Level Speech Emotion Recognition
    Mao, Shuiyang
    Ching, P. C.
    Lee, Tan
    INTERSPEECH 2019, 2019, : 1686 - 1690
  • [34] Unsupervised Deep Learning of Mid-Level Video Representation for Action Recognition
    Hou, Jingyi
    Wu, Xinxiao
    Chen, Jin
    Luo, Jiebo
    Jia, Yunde
    THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 6910 - 6917
  • [35] Dynamic-boosting attention for self-supervised video representation learning
    Zhipeng Wang
    Chunping Hou
    Guanghui Yue
    Qingyuan Yang
    Applied Intelligence, 2022, 52 : 3143 - 3155
  • [36] Dynamic-boosting attention for self-supervised video representation learning
    Wang, Zhipeng
    Hou, Chunping
    Yue, Guanghui
    Yang, Qingyuan
    APPLIED INTELLIGENCE, 2022, 52 (03) : 3143 - 3155
  • [37] Dynamic Representation Learning for Video Action Recognition Using Temporal Residual Networks
    Kong, Yongqiang
    Huang, Jianhui
    Huang, Shanshan
    Wei, Zhengang
    Wang, Shengke
    2018 IEEE SMARTWORLD, UBIQUITOUS INTELLIGENCE & COMPUTING, ADVANCED & TRUSTED COMPUTING, SCALABLE COMPUTING & COMMUNICATIONS, CLOUD & BIG DATA COMPUTING, INTERNET OF PEOPLE AND SMART CITY INNOVATION (SMARTWORLD/SCALCOM/UIC/ATC/CBDCOM/IOP/SCI), 2018, : 331 - 337
  • [38] Multi-Granularity Aggregation Transformer for Joint Video-Audio-Text Representation Learning
    He, Mengge
    Du, Wenjing
    Wen, Zhiquan
    Du, Qing
    Xie, Yutong
    Wu, Qi
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (06) : 2990 - 3002
  • [39] Self-Supervised Video Representation Learning with Meta-Contrastive Network
    Lin, Yuanze
    Guo, Xun
    Lu, Yan
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 8219 - 8229
  • [40] Network Representation Learning Method Based on Spatial-Temporal Graph in Dynamic Network
    Cheng, Xiaotao
    Ji, Lixin
    Yin, Ying
    Huang, Ruiyang
    PROCEEDINGS OF 2019 IEEE 9TH INTERNATIONAL CONFERENCE ON ELECTRONICS INFORMATION AND EMERGENCY COMMUNICATION (ICEIEC 2019), 2019, : 196 - 200