Spatial and Temporal Features Aggregation Convolutional Network Model Based on Locality-Constrained Affine Subspace Coding

被引:0
|
作者
Zhang, Bing-Bing [1 ]
Li, Pei-Hua [1 ]
Sun, Qiu-Le [1 ]
机构
[1] School of Information and Communication Engineering, Dalian University of Technology, Dalian,Liaoning,116033, China
来源
基金
中国国家自然科学基金;
关键词
Virtual reality - Image classification - Image coding - Semantics - Video signal processing - Security systems - Image retrieval - Human computer interaction - Network architecture - Computer hardware description languages;
D O I
10.11897/SP.J.1016.2020.01589
中图分类号
学科分类号
摘要
Video-based human action recognition is an import task in the field of computer vision. It is widely used in video surveillance, virtual reality and human-computer interaction. This is also a very challenging task because of the large amount of video data and high requirements for computing hardware systems. An effective video representation needs to consider spatial and temporal cues simultaneously. In recent years, researchers are working to develop general network architecture for video classification. 3D spatial-temporal convolutions that potentially learn complicated spatial-temporal dependencies but the large number of parameters in 3D CNNs make it hard to train in practice. Two-stream architectures that decompose the video into motion and appearance streams, and train separate CNNs for each stream, fusing the outputs in the end. Among these successful network architectures, two-steam convolutional network has a great influence in the academic research field. Two-stream convolutional network can model the appearance and motion information in videos, and becomes an important basic network model in action recognition. Two-stream architectures essentially learn a classifier that operates on individual frames or short clip of few frames possibly enforcing consensus of classification scores over different segments of the video. At test time, 25 uniformly sampled frames are classified independently and the classifications scores are averaged to get the final prediction. However, such architectures mainly focus on learning of spatial information of a single frame and the temporal information of a few frames, thus failing to effectively model the long-term information in the whole video. To overcome the problem, we propose the spatial and temporal features aggregation convolutional network model based on locality-constrained affine subspace coding. LASC coding method has achieved the excellent performance of in image classification and image retrieval tasks. LASC leverages the semantic probabilities of local patches to learn the aggregation weights and construct the semantic affine subspace dictionary, which produces more semantic and discriminative global image representations. Inspired by the classical LASC coding method, we design a LASC-based structure layer to insert into the last layer convolutional layer of spatial stream and temporal stream to acquire more robust high-dimension representation and the two fully-connected layer in two-stream architecture is totally replaced. The core of the network is a locality-constrained affine subspace coding layer, which can be embedded in the two-stream convolution network for aggregating spatial and temporal features of the whole video to obtain the global temporal and spatial video representation. This layer consists of two sub-layers of computing weight coefficients and affine subspace coding, in which parameters of the layer can be optimized jointly with other parameters in convolution network for end-to-end learning. Besides, three regulations, i.e., soft orthogonality regulation, infinite-norm regulation and spectral-norm regulation in cost functions, are further studied to ensure the orthogonality of affine subspace bases during the training process. The proposed method is measured on the commonly used UCF10, HMDB51 and Something-V1 datasets, and the accuracy of our method is 1.7%, 8.7% and 4.3% higher than the classical two-stream convolution network, respectively. At the same time, it achieves superior or competitive performance in comparison to state-of-the-art methods. © 2020, Science Press. All right reserved.
引用
收藏
页码:1589 / 1603
相关论文
共 50 条
  • [1] Locality-constrained affine subspace coding for image classification and retrieval
    Zhang, Bingbing
    Wang, Qilong
    Lu, Xiaoxiao
    Wang, Fasheng
    Li, Peihua
    [J]. PATTERN RECOGNITION, 2020, 100
  • [2] From Dictionary of Visual Words to Subspaces: Locality-constrained Affine Subspace Coding
    Li, Peihua
    Lu, Xiaoxiao
    Wang, Qilong
    [J]. 2015 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2015, : 2348 - 2357
  • [3] Detecting anomalies in crowded scenes via locality-constrained affine subspace coding
    Fan, Yaxiang
    Wen, Gongjian
    Qiu, Shaohua
    Li, Deren
    [J]. JOURNAL OF ELECTRONIC IMAGING, 2017, 26 (04)
  • [4] 3D Action Recognition Using Depth-based Feature and Locality-constrained Affine Subspace Coding
    Liang, Chengwu
    Chen, Enqing
    Qi, Lin
    Guan, Ling
    [J]. PROCEEDINGS OF 2016 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM), 2016, : 261 - 266
  • [5] 3D Human Action Recognition Using a Single Depth Feature and Locality-Constrained Affine Subspace Coding
    Liang, Chengwu
    Qi, Lin
    He, Yifeng
    Guan, Ling
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2018, 28 (10) : 2920 - 2932
  • [6] Spatio-temporal Video Representation with Locality-Constrained Linear Coding
    Al Ghamdi, Manal
    Al Harbi, Nouf
    Gotoh, Yoshihiko
    [J]. COMPUTER VISION - ECCV 2012, PT III, 2012, 7585 : 101 - 110
  • [7] LOCALITY-CONSTRAINED SPATIAL TRANSFORMER NETWORK FOR VIDEO CROWD COUNTING
    Fang, Yanyan
    Zhan, Biyun
    Cai, Wandi
    Gao, Shenghua
    Hu, Bo
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 814 - 819
  • [8] CSIFT based locality-constrained linear coding for image classification
    Chen, Junzhou
    Li, Qing
    Peng, Qiang
    Wong, Kin Hong
    [J]. PATTERN ANALYSIS AND APPLICATIONS, 2015, 18 (02) : 441 - 450
  • [9] Human action recognition based on locality-constrained linear coding
    School of Instrumentation Science and Opto-electronics Engineering, Beijing University of Aeronautics and Astronautics, Beijing
    100191, China
    [J]. Beijing Hangkong Hangtian Daxue Xuebao, 6 (1122-1127):
  • [10] CSIFT based locality-constrained linear coding for image classification
    Junzhou Chen
    Qing Li
    Qiang Peng
    Kin Hong Wong
    [J]. Pattern Analysis and Applications, 2015, 18 : 441 - 450