Video-based human action recognition is an import task in the field of computer vision. It is widely used in video surveillance, virtual reality and human-computer interaction. This is also a very challenging task because of the large amount of video data and high requirements for computing hardware systems. An effective video representation needs to consider spatial and temporal cues simultaneously. In recent years, researchers are working to develop general network architecture for video classification. 3D spatial-temporal convolutions that potentially learn complicated spatial-temporal dependencies but the large number of parameters in 3D CNNs make it hard to train in practice. Two-stream architectures that decompose the video into motion and appearance streams, and train separate CNNs for each stream, fusing the outputs in the end. Among these successful network architectures, two-steam convolutional network has a great influence in the academic research field. Two-stream convolutional network can model the appearance and motion information in videos, and becomes an important basic network model in action recognition. Two-stream architectures essentially learn a classifier that operates on individual frames or short clip of few frames possibly enforcing consensus of classification scores over different segments of the video. At test time, 25 uniformly sampled frames are classified independently and the classifications scores are averaged to get the final prediction. However, such architectures mainly focus on learning of spatial information of a single frame and the temporal information of a few frames, thus failing to effectively model the long-term information in the whole video. To overcome the problem, we propose the spatial and temporal features aggregation convolutional network model based on locality-constrained affine subspace coding. LASC coding method has achieved the excellent performance of in image classification and image retrieval tasks. LASC leverages the semantic probabilities of local patches to learn the aggregation weights and construct the semantic affine subspace dictionary, which produces more semantic and discriminative global image representations. Inspired by the classical LASC coding method, we design a LASC-based structure layer to insert into the last layer convolutional layer of spatial stream and temporal stream to acquire more robust high-dimension representation and the two fully-connected layer in two-stream architecture is totally replaced. The core of the network is a locality-constrained affine subspace coding layer, which can be embedded in the two-stream convolution network for aggregating spatial and temporal features of the whole video to obtain the global temporal and spatial video representation. This layer consists of two sub-layers of computing weight coefficients and affine subspace coding, in which parameters of the layer can be optimized jointly with other parameters in convolution network for end-to-end learning. Besides, three regulations, i.e., soft orthogonality regulation, infinite-norm regulation and spectral-norm regulation in cost functions, are further studied to ensure the orthogonality of affine subspace bases during the training process. The proposed method is measured on the commonly used UCF10, HMDB51 and Something-V1 datasets, and the accuracy of our method is 1.7%, 8.7% and 4.3% higher than the classical two-stream convolution network, respectively. At the same time, it achieves superior or competitive performance in comparison to state-of-the-art methods. © 2020, Science Press. All right reserved.