Video-level Multi-model Fusion for Action Recognition

被引:3
|
作者
Wang, Xiaomin [1 ]
Zhang, Junsan [1 ]
Wang, Leiquan [1 ]
Yu, Philip S. [2 ]
Zhu, Jie [3 ]
Li, Haisheng [4 ]
机构
[1] China Univ Petr EastChina, Coll Comp Sci & Technol, Qingdao, Shandong, Peoples R China
[2] Univ Illinois, Dept Comp Sci, Chicago, IL 60680 USA
[3] Natl Police Univ Criminal Justice, Dept Informat Management, Hangzhou, Peoples R China
[4] Beijing Technol & Business Univ, Beijing Key Lab Big Data Technol Food Safety, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
action recognition; video-leval recognition; 3D convolution; multi-model fusion;
D O I
10.1145/3357384.3357935
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The approaches based on spatio-temporal features for video action recognition have emerged such as two-stream based methods and 3D convolution based methods. However, current methods suffer from the problems caused by partial observation, or restricted to single information modeling, and so on. Segment-level recognition results obtained from dense sampling can not represent the entire video and, therefore lead to partial observation. And a single model is hard to capture the complementary information on spacial, temporal and spatio-temporal information from video at the same time. Therefore, the challenge is to build the video-level representation and capture multiple information. In this paper, a video-level multi-model fusion action recognition method is proposed to solve these problems. Firstly, an efficient video-level 3D convolution model is proposed to get the global information in the video which assembling segment-level 3D convolution models. Secondly, a multi-model fusion architecture is proposed for video action recognition to capture multiple information. The spatial, temporal and spatio-temporal information are aggregate with SVM classifier. Experimental results show that this method achieves the state-of-the-art performance on the datasets of UCF-101(97.6%) without pre-training on Kinetics.
引用
收藏
页码:159 / 168
页数:10
相关论文
共 50 条
  • [41] Segmentation and recognition of multi-model photo event
    Yang, Feibin
    Huang, Qinghua
    Jin, Lianwen
    Liew, Alan Wee-Chung
    NEUROCOMPUTING, 2016, 172 : 159 - 167
  • [42] Multi-model approach for noisy speech recognition
    Guan, CT
    Leung, SH
    Lau, WH
    ELECTRONICS LETTERS, 1998, 34 (01) : 30 - 32
  • [43] Multi-model approach for noisy speech recognition
    Electron Lett, 1 (30-32):
  • [44] M-adapter: Multi-level image-to-video adaptation for video action recognition
    Li, Rongchang
    Xu, Tianyang
    Wu, Xiao-Jun
    Yang, Xiao
    Li, Linze
    Shen, Zhongwei
    Kittler, Josef
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 249
  • [45] Multi-model fusion framework based on multi-input cross-language emotional speech recognition
    Hu, Guohua
    Zhao, Qingshan
    International Journal of Wireless and Mobile Computing, 2021, 20 (01): : 32 - 40
  • [46] Two-Level Attention Model Based Video Action Recognition Network
    Sang, Haifeng
    Zhao, Ziyu
    He, Dakuo
    IEEE ACCESS, 2019, 7 : 118388 - 118401
  • [47] Vehicle Logo Recognition Using Multi-level Fusion Model
    Ming, Wei
    Xiao, Jianli
    NINTH INTERNATIONAL CONFERENCE ON GRAPHIC AND IMAGE PROCESSING (ICGIP 2017), 2018, 10615
  • [48] Feature and Decision Level Fusion for Action Recognition
    Abouelenien, Mohamed
    Wan, Yiwen
    Saudagar, Abdullah
    2012 THIRD INTERNATIONAL CONFERENCE ON COMPUTING COMMUNICATION & NETWORKING TECHNOLOGIES (ICCCNT), 2012,
  • [49] Residual attention fusion network for video action recognition
    Li, Ao
    Yi, Yang
    Liang, Daan
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2024, 98
  • [50] Video Temporal Grounding with Multi-Model Collaborative Learning
    Tian, Yun
    Guo, Xiaobo
    Wang, Jinsong
    Li, Bin
    Zhou, Shoujun
    APPLIED SCIENCES-BASEL, 2025, 15 (06):