DSANet: Dynamic Segment Aggregation Network for Video-Level Representation Learning

被引:4
|
作者
Wu, Wenhao [1 ]
Zhao, Yuxiang [1 ,2 ]
Xu, Yanwu [3 ]
Tan, Xiao [1 ]
He, Dongliang [1 ]
Zou, Zhikang [1 ]
Ye, Jin [1 ]
Li, Yingying [1 ]
Yao, Mingde [1 ]
Dong, Zichao [1 ]
Shi, Yifeng [1 ]
机构
[1] Baidu Inc, Beijing, Peoples R China
[2] Chinese Acad Sci, Shenzhen Inst Adv Technol, Shenzhen, Peoples R China
[3] Univ Pittsburgh, Pittsburgh, PA 15260 USA
关键词
neural networks; action recognition; video representation learning;
D O I
10.1145/3474085.3475344
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Long-range and short-range temporal modeling are two complementary and crucial aspects of video recognition. Most of the stateof-the-arts focus on short-range spatio-temporal modeling and then average multiple snippet-level predictions to yield the final videolevel prediction. Thus, their video-level prediction does not consider spatio-temporal features of how video evolves along the temporal dimension. In this paper, we introduce a novel Dynamic Segment Aggregation (DSA) module to capture relationship among snippets. To be more specific, we attempt to generate a dynamic kernel for a convolutional operation to aggregate long-range temporal information among adjacent snippets adaptively. The DSA module is an efficient plug-and-play module and can be combined with the off-the-shelf clip-based models (i.e., TSM, I3D) to perform powerful long-range modeling with minimal overhead. The final video architecture, coined as DSANet. We conduct extensive experiments on several video recognition benchmarks (i.e., Mini-Kinetics-200, Kinetics-400, Something-Something V1 and ActivityNet) to show its superiority. Our proposed DSA module is shown to benefit various video recognition models significantly. For example, equipped with DSA modules, the top-1 accuracy of I3D ResNet-50 is improved from 74.9% to 78.2% on Kinetics-400. Codes are available at https://github.com/whwu95/DSANet.
引用
收藏
页码:1903 / 1911
页数:9
相关论文
共 50 条
  • [1] End-to-end Video-level Representation Learning for Action Recognition
    Zhu, Jiagang
    Zhu, Zheng
    Zou, Wei
    2018 24TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2018, : 645 - 650
  • [2] Channel attention convolutional aggregation network based on video-level features for EEG emotion recognition
    Feng, Xin
    Cong, Ping
    Dong, Lin
    Xin, Yongxian
    Miao, Fengbo
    Xin, Ruihao
    COGNITIVE NEURODYNAMICS, 2024, 18 (04) : 1689 - 1707
  • [3] Consistent constraint-based video-level learning for action recognition
    Shi, Qinghongya
    Zhang, Hong-Bo
    Ren, Hao-Tian
    Du, Ji-Xiang
    Lei, Qing
    EURASIP JOURNAL ON IMAGE AND VIDEO PROCESSING, 2020, 2020 (01)
  • [4] Consistent constraint-based video-level learning for action recognition
    Qinghongya Shi
    Hong-Bo Zhang
    Hao-Tian Ren
    Ji-Xiang Du
    Qing Lei
    EURASIP Journal on Image and Video Processing, 2020
  • [5] Beyond Short Clips: End-to-End Video-Level Learning with Collaborative Memories
    Yang, Xitong
    Fan, Haoqi
    Torresani, Lorenzo
    Davis, Larry
    Wang, Heng
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 7563 - 7572
  • [6] Learning frame-level affinity with video-level labels for weakly supervised temporal action detection
    Li, Bairong
    Zhu, Yuesheng
    Liu, Ruixin
    Weng, Zhenyu
    NEUROCOMPUTING, 2021, 463 : 109 - 121
  • [7] Spatio-Temporal Crop Aggregation for Video Representation Learning
    Sameni, Sepehr
    Jenni, Simon
    Favaro, Paolo
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 5641 - 5651
  • [8] Dynamic Network Representation Learning:A Review
    Cao Y.
    Dong Y.-H.
    Wu S.-Q.
    Chen H.-H.
    Qian J.-B.
    Pan S.-L.
    Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2020, 48 (10): : 2047 - 2059
  • [9] Representation Learning on Dynamic Network of Networks
    Zhang, Si
    Xia, Yinglong
    Zhu, Yan
    Tong, Hanghang
    PROCEEDINGS OF THE 2023 SIAM INTERNATIONAL CONFERENCE ON DATA MINING, SDM, 2023, : 298 - 306
  • [10] Learning Video Localization on Segment-Level Video Copy Detection with Transformer
    Zhang, Chi
    Liu, Jie
    Zhang, Shuwu
    Zeng, Zhi
    Huang, Ying
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT VII, 2023, 14260 : 439 - 450