DSANet: Dynamic Segment Aggregation Network for Video-Level Representation Learning

被引:4
|
作者
Wu, Wenhao [1 ]
Zhao, Yuxiang [1 ,2 ]
Xu, Yanwu [3 ]
Tan, Xiao [1 ]
He, Dongliang [1 ]
Zou, Zhikang [1 ]
Ye, Jin [1 ]
Li, Yingying [1 ]
Yao, Mingde [1 ]
Dong, Zichao [1 ]
Shi, Yifeng [1 ]
机构
[1] Baidu Inc, Beijing, Peoples R China
[2] Chinese Acad Sci, Shenzhen Inst Adv Technol, Shenzhen, Peoples R China
[3] Univ Pittsburgh, Pittsburgh, PA 15260 USA
关键词
neural networks; action recognition; video representation learning;
D O I
10.1145/3474085.3475344
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Long-range and short-range temporal modeling are two complementary and crucial aspects of video recognition. Most of the stateof-the-arts focus on short-range spatio-temporal modeling and then average multiple snippet-level predictions to yield the final videolevel prediction. Thus, their video-level prediction does not consider spatio-temporal features of how video evolves along the temporal dimension. In this paper, we introduce a novel Dynamic Segment Aggregation (DSA) module to capture relationship among snippets. To be more specific, we attempt to generate a dynamic kernel for a convolutional operation to aggregate long-range temporal information among adjacent snippets adaptively. The DSA module is an efficient plug-and-play module and can be combined with the off-the-shelf clip-based models (i.e., TSM, I3D) to perform powerful long-range modeling with minimal overhead. The final video architecture, coined as DSANet. We conduct extensive experiments on several video recognition benchmarks (i.e., Mini-Kinetics-200, Kinetics-400, Something-Something V1 and ActivityNet) to show its superiority. Our proposed DSA module is shown to benefit various video recognition models significantly. For example, equipped with DSA modules, the top-1 accuracy of I3D ResNet-50 is improved from 74.9% to 78.2% on Kinetics-400. Codes are available at https://github.com/whwu95/DSANet.
引用
收藏
页码:1903 / 1911
页数:9
相关论文
共 50 条
  • [41] BrainTGL: A dynamic graph representation learning model for brain network analysis
    Liu, Lingwen
    Wen, Guangqi
    Cao, Peng
    Hong, Tianshun
    Yang, Jinzhu
    Zhang, Xizhe
    Zaiane, Osmar R.
    COMPUTERS IN BIOLOGY AND MEDICINE, 2023, 153
  • [42] Learning graph representation with Randomized Neural Network for dynamic texture classification
    Ribas, Lucas C.
    de Mesquita Sa Junior, Jarbas Joaci
    Manzanera, Antoine
    Bruno, Odemir M.
    APPLIED SOFT COMPUTING, 2022, 114
  • [43] Dynamic Heterogeneous Network Representation Learning for Fraud Detection in Auto Insurance
    Pan, Yijun
    Liang, Bian
    Zhang, Long
    Na, Chongning
    Computer Engineering and Applications, 60 (24): : 322 - 330
  • [44] A recurrent graph neural network for inductive representation learning on dynamic graphs
    Yao, Hong-Yu
    Zhang, Chun-Yang
    Yao, Zhi-Liang
    Chen, C. L. Philip
    Hu, Junfeng
    PATTERN RECOGNITION, 2024, 154
  • [45] Dynamic Attribute Network Representation Learning and Visualization of Integrated Spatiotemporal Trajectory
    Cao Y.
    Yan M.
    Jia X.
    Dong Y.
    Chen H.
    Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2021, 33 (03): : 487 - 496
  • [46] Dynamic network representation learning based on community structure and evolutionary clustering
    Wang, Peizhuo
    Yao, Shunyu
    Zhang, Kun
    Wu, Shangzi
    2022 41ST CHINESE CONTROL CONFERENCE (CCC), 2022, : 7419 - 7424
  • [47] Multi-level Feature Aggregation Network for High Dynamic Range Imaging
    Xiao, Jun
    Lam, Kin-Man
    INTERNATIONAL WORKSHOP ON ADVANCED IMAGING TECHNOLOGY (IWAIT) 2022, 2022, 12177
  • [48] DynHEN: A heterogeneous network model for dynamic bipartite graph representation learning
    Xing, Zhezhe
    Song, Rui
    Teng, Yun
    Xu, Hao
    NEUROCOMPUTING, 2022, 508 : 47 - 57
  • [49] Multi-Level Representation Learning with Semantic Alignment for Referring Video Object Segmentation
    Wu, Dongming
    Dong, Xingping
    Shao, Ling
    Shen, Jianbing
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 4986 - 4995
  • [50] Multi-Level Visual Representation with Semantic-Reinforced Learning for Video Captioning
    Dong, Chengbo
    Chen, Xinru
    Chen, Aozhu
    Hu, Fan
    Wang, Zihan
    Li, Xirong
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 4750 - 4754