A component-based video content representation for action recognition

被引:9
|
作者
Adeli, Vida [1 ]
Fazl-Ersi, Ehsan [1 ]
Harati, Ahad [1 ]
机构
[1] Ferdowsi Univ Mashhad, Dept Comp Engn, Mashhad 9177948944, Razavi Khorasan, Iran
关键词
Actionness likelihood; Action recognition; Action components; LSTM; Three-stream convolutional neural network; MOTION REPRESENTATION;
D O I
10.1016/j.imavis.2019.08.009
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper investigates the challenging problem of action recognition in videos and proposes a new component-based approach for video content representation. Although satisfactory performance for action recognition has already been obtained for certain scenarios, many of the existing solutions require fully-annotated video datasets in which region of the activity in each frame is specified by a bounding box. Another group of methods require auxiliary techniques to extract human-related areas in the video frames before being able to accurately recognize actions. In this paper, a Weakly-Supervised Learning (WSL) framework is introduced that eliminates the need for per-frame annotations and learns video representations that improve recognition accuracy and also highlights the activity related regions within each frame. To this end, two new representation ideas are proposed, one focus on representing the main components of an action, i.e. actionness regions, and the other focus on encoding the background context to represent general and holistic cues. A three-stream CNN is developed, which takes the two proposed representations and combines them with a motion-encoding stream. Temporal cues in each of the three different streams are modeled through LSTM, and finally fully-connected neural network layers are used to fuse various streams and produce the final video representation. Experimental results on four challenging datasets, demonstrate that the proposed Component-based Multi-stream CNN model (CM-CNN), trained on a WSL setting, outperforms the state-of-the-art in action recognition, even the fully-supervised approaches. (C) 2019 Elsevier B.V. All rights reserved.
引用
收藏
页数:15
相关论文
共 50 条
  • [1] Component-Based Representation in Automated Face Recognition
    Bonnen, Kathryn
    Klare, Brendan F.
    Jain, Anil K.
    [J]. IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2013, 8 (01) : 239 - 253
  • [2] ACTION RECOGNITION BASED ON KINEMATIC REPRESENTATION OF VIDEO DATA
    Sun, Xin
    Huang, Di
    Wang, Yunhong
    Qin, Jie
    [J]. 2014 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2014, : 1530 - 1534
  • [3] An overview of sparse representation based action recognition in video
    Ushapreethi, P.
    Lakshmipriya, G. G.
    [J]. 2018 2ND INTERNATIONAL CONFERENCE ON COMPUTER, COMMUNICATION, AND SIGNAL PROCESSING (ICCCSP): SPECIAL FOCUS ON TECHNOLOGY AND INNOVATION FOR SMART ENVIRONMENT, 2018, : 63 - 67
  • [4] Video action recognition based on visual rhythm representation
    Moreira, Thierry Pinheiro
    Menotti, David
    Pedrini, Helio
    [J]. JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2020, 71 (71)
  • [5] A DISTRIBUTION BASED VIDEO REPRESENTATION FOR HUMAN ACTION RECOGNITION
    Song, Yan
    Tang, Sheng
    Zheng, Yan-Tao
    Chua, Tat-Seng
    Zhang, Yongdong
    Lin, Shouxun
    [J]. 2010 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME 2010), 2010, : 772 - 777
  • [6] Component-based feature extraction and representation schemes for vehicle make and model recognition
    Lu, Lei
    Huang, Hua
    [J]. NEUROCOMPUTING, 2020, 372 : 92 - 99
  • [7] An Enhanced Independent Component-Based Human Facial Expression Recognition from Video
    Uddin, Md. Zia
    Lee, J. J.
    Kim, T. -S.
    [J]. IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 2009, 55 (04) : 2216 - 2224
  • [8] A component-based framework for recognition systems
    Middendorf, M
    Peust, C
    Schacht, J
    [J]. READING AND LEARNING, 2004, 2956 : 153 - 165
  • [9] Learning hierarchical video representation for action recognition
    Li Q.
    Qiu Z.
    Yao T.
    Mei T.
    Rui Y.
    Luo J.
    [J]. International Journal of Multimedia Information Retrieval, 2017, 6 (1) : 85 - 98
  • [10] A Robust and Efficient Video Representation for Action Recognition
    Heng Wang
    Dan Oneata
    Jakob Verbeek
    Cordelia Schmid
    [J]. International Journal of Computer Vision, 2016, 119 : 219 - 238