DSANet: Dynamic Segment Aggregation Network for Video-Level Representation Learning

被引:4
|
作者
Wu, Wenhao [1 ]
Zhao, Yuxiang [1 ,2 ]
Xu, Yanwu [3 ]
Tan, Xiao [1 ]
He, Dongliang [1 ]
Zou, Zhikang [1 ]
Ye, Jin [1 ]
Li, Yingying [1 ]
Yao, Mingde [1 ]
Dong, Zichao [1 ]
Shi, Yifeng [1 ]
机构
[1] Baidu Inc, Beijing, Peoples R China
[2] Chinese Acad Sci, Shenzhen Inst Adv Technol, Shenzhen, Peoples R China
[3] Univ Pittsburgh, Pittsburgh, PA 15260 USA
关键词
neural networks; action recognition; video representation learning;
D O I
10.1145/3474085.3475344
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Long-range and short-range temporal modeling are two complementary and crucial aspects of video recognition. Most of the stateof-the-arts focus on short-range spatio-temporal modeling and then average multiple snippet-level predictions to yield the final videolevel prediction. Thus, their video-level prediction does not consider spatio-temporal features of how video evolves along the temporal dimension. In this paper, we introduce a novel Dynamic Segment Aggregation (DSA) module to capture relationship among snippets. To be more specific, we attempt to generate a dynamic kernel for a convolutional operation to aggregate long-range temporal information among adjacent snippets adaptively. The DSA module is an efficient plug-and-play module and can be combined with the off-the-shelf clip-based models (i.e., TSM, I3D) to perform powerful long-range modeling with minimal overhead. The final video architecture, coined as DSANet. We conduct extensive experiments on several video recognition benchmarks (i.e., Mini-Kinetics-200, Kinetics-400, Something-Something V1 and ActivityNet) to show its superiority. Our proposed DSA module is shown to benefit various video recognition models significantly. For example, equipped with DSA modules, the top-1 accuracy of I3D ResNet-50 is improved from 74.9% to 78.2% on Kinetics-400. Codes are available at https://github.com/whwu95/DSANet.
引用
收藏
页码:1903 / 1911
页数:9
相关论文
共 50 条
  • [11] Improving Video Model Transfer with Dynamic Representation Learning
    Li, Yi
    Vasconcelos, Nuno
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 19258 - 19269
  • [12] Dynamic and Static Representation Learning Network for Recommendation
    Liu, Tongcun
    Lou, Siyuan
    Liao, Jianxin
    Feng, Hailin
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (01) : 831 - 841
  • [13] Crowd Violence Detection Using Global Motion-Compensated Lagrangian Features and Scale-Sensitive Video-Level Representation
    Senst, Tobias
    Eiselein, Volker
    Kuhn, Alexander
    Sikora, Thomas
    IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2017, 12 (12) : 2945 - 2956
  • [14] Temporal Network Representation Learning via Historical Neighborhoods Aggregation
    Huang, Shixun
    Bao, Zhifeng
    Li, Guoliang
    Zhou, Yanghao
    Culpepper, J. Shane
    2020 IEEE 36TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2020), 2020, : 1117 - 1128
  • [15] Embedded Representation Learning Network for Animating Styled Video Portrait
    Wang, Tianyong
    Liang, Xiangyu
    Zheng, Wangguandong
    Niu, Dan
    Xia, Haifeng
    Xia, Siyu
    2024 IEEE 18TH INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION, FG 2024, 2024,
  • [16] Learning explicit video attributes from mid-level representation for video captioning
    Nian, Fudong
    Li, Teng
    Wang, Yan
    Wu, Xinyu
    Ni, Bingbing
    Xu, Changsheng
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2017, 163 : 126 - 138
  • [17] Segment differential aggregation representation and supervised compensation learning of ConvNets for human action recognition
    REN ZiLiang
    ZHANG QieShi
    CHENG Qin
    XU ZhenYu
    YUAN Shuai
    LUO DeLin
    Science China Technological Sciences, 2024, (01) : 197 - 208
  • [18] Segment differential aggregation representation and supervised compensation learning of ConvNets for human action recognition
    Ren, Ziliang
    Zhang, Qieshi
    Cheng, Qin
    Xu, Zhenyu
    Yuan, Shuai
    Luo, Delin
    SCIENCE CHINA-TECHNOLOGICAL SCIENCES, 2023, 67 (01) : 197 - 208
  • [19] Segment differential aggregation representation and supervised compensation learning of ConvNets for human action recognition
    ZiLiang Ren
    QieShi Zhang
    Qin Cheng
    ZhenYu Xu
    Shuai Yuan
    DeLin Luo
    Science China Technological Sciences, 2024, 67 : 197 - 208
  • [20] Learning Network Representation via Ego-Network-Level Relationship
    Yan, Bencheng
    Huang, Shenglei
    NEURAL INFORMATION PROCESSING (ICONIP 2019), PT IV, 2019, 1142 : 414 - 422