Spatial and Temporal Features Aggregation Convolutional Network Model Based on Locality-Constrained Affine Subspace Coding

被引：0

作者：

Zhang, Bing-Bing ^{[1
]}

Li, Pei-Hua ^{[1
]}

Sun, Qiu-Le ^{[1
]}

机构：

[1] School of Information and Communication Engineering, Dalian University of Technology, Dalian,Liaoning,116033, China

来源：

Jisuanji Xuebao/Chinese Journal of Computers | 2020年 / 43卷 / 09期

基金：

中国国家自然科学基金;

关键词：

Virtual reality - Image classification - Image coding - Semantics - Video signal processing - Security systems - Image retrieval - Human computer interaction - Network architecture - Computer hardware description languages;

D O I：

10.11897/SP.J.1016.2020.01589

中图分类号：

学科分类号：

摘要：

Video-based human action recognition is an import task in the field of computer vision. It is widely used in video surveillance, virtual reality and human-computer interaction. This is also a very challenging task because of the large amount of video data and high requirements for computing hardware systems. An effective video representation needs to consider spatial and temporal cues simultaneously. In recent years, researchers are working to develop general network architecture for video classification. 3D spatial-temporal convolutions that potentially learn complicated spatial-temporal dependencies but the large number of parameters in 3D CNNs make it hard to train in practice. Two-stream architectures that decompose the video into motion and appearance streams, and train separate CNNs for each stream, fusing the outputs in the end. Among these successful network architectures, two-steam convolutional network has a great influence in the academic research field. Two-stream convolutional network can model the appearance and motion information in videos, and becomes an important basic network model in action recognition. Two-stream architectures essentially learn a classifier that operates on individual frames or short clip of few frames possibly enforcing consensus of classification scores over different segments of the video. At test time, 25 uniformly sampled frames are classified independently and the classifications scores are averaged to get the final prediction. However, such architectures mainly focus on learning of spatial information of a single frame and the temporal information of a few frames, thus failing to effectively model the long-term information in the whole video. To overcome the problem, we propose the spatial and temporal features aggregation convolutional network model based on locality-constrained affine subspace coding. LASC coding method has achieved the excellent performance of in image classification and image retrieval tasks. LASC leverages the semantic probabilities of local patches to learn the aggregation weights and construct the semantic affine subspace dictionary, which produces more semantic and discriminative global image representations. Inspired by the classical LASC coding method, we design a LASC-based structure layer to insert into the last layer convolutional layer of spatial stream and temporal stream to acquire more robust high-dimension representation and the two fully-connected layer in two-stream architecture is totally replaced. The core of the network is a locality-constrained affine subspace coding layer, which can be embedded in the two-stream convolution network for aggregating spatial and temporal features of the whole video to obtain the global temporal and spatial video representation. This layer consists of two sub-layers of computing weight coefficients and affine subspace coding, in which parameters of the layer can be optimized jointly with other parameters in convolution network for end-to-end learning. Besides, three regulations, i.e., soft orthogonality regulation, infinite-norm regulation and spectral-norm regulation in cost functions, are further studied to ensure the orthogonality of affine subspace bases during the training process. The proposed method is measured on the commonly used UCF10, HMDB51 and Something-V1 datasets, and the accuracy of our method is 1.7%, 8.7% and 4.3% higher than the classical two-stream convolution network, respectively. At the same time, it achieves superior or competitive performance in comparison to state-of-the-art methods. © 2020, Science Press. All right reserved.

引用

页码：1589 / 1603

共 50 条

[1] Locality-constrained affine subspace coding for image classification and retrieval
Zhang, Bingbing
Wang, Qilong
Lu, Xiaoxiao
Wang, Fasheng
Li, Peihua
[J]. PATTERN RECOGNITION, 2020, 100
[2] From Dictionary of Visual Words to Subspaces: Locality-constrained Affine Subspace Coding
Li, Peihua
Lu, Xiaoxiao
Wang, Qilong
[J]. 2015 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2015, : 2348 - 2357
[3] Detecting anomalies in crowded scenes via locality-constrained affine subspace coding
Fan, Yaxiang
Wen, Gongjian
Qiu, Shaohua
Li, Deren
[J]. JOURNAL OF ELECTRONIC IMAGING, 2017, 26 (04)
[4] 3D Action Recognition Using Depth-based Feature and Locality-constrained Affine Subspace Coding
Liang, Chengwu
Chen, Enqing
Qi, Lin
Guan, Ling
[J]. PROCEEDINGS OF 2016 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM), 2016, : 261 - 266
[5] 3D Human Action Recognition Using a Single Depth Feature and Locality-Constrained Affine Subspace Coding
Liang, Chengwu
Qi, Lin
He, Yifeng
Guan, Ling
[J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2018, 28 (10) : 2920 - 2932
[6] Spatio-temporal Video Representation with Locality-Constrained Linear Coding
Al Ghamdi, Manal
Al Harbi, Nouf
Gotoh, Yoshihiko
[J]. COMPUTER VISION - ECCV 2012, PT III, 2012, 7585 : 101 - 110
[7] LOCALITY-CONSTRAINED SPATIAL TRANSFORMER NETWORK FOR VIDEO CROWD COUNTING
Fang, Yanyan
Zhan, Biyun
Cai, Wandi
Gao, Shenghua
Hu, Bo
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 814 - 819
[8] CSIFT based locality-constrained linear coding for image classification
Chen, Junzhou
Li, Qing
Peng, Qiang
Wong, Kin Hong
[J]. PATTERN ANALYSIS AND APPLICATIONS, 2015, 18 (02) : 441 - 450
[9] Human action recognition based on locality-constrained linear coding
School of Instrumentation Science and Opto-electronics Engineering, Beijing University of Aeronautics and Astronautics, Beijing
100191, China
[J]. Beijing Hangkong Hangtian Daxue Xuebao, 6 (1122-1127):
[10] CSIFT based locality-constrained linear coding for image classification
Junzhou Chen
Qing Li
Qiang Peng
Kin Hong Wong
[J]. Pattern Analysis and Applications, 2015, 18 : 441 - 450

← 1 2 3 4 5 →