So far convolutional neural network (CNN) is one of the best-known deep learning models and is pervasively employed in computer vision tasks. CNN models the forward connectivity of the visual cortex, however, there are also a huge amount of recurrent connections and the unique attention mechanism in the biological visual system. Inspired by this fact, we take the most important elements from the structural and functional characteristics of the visual cortex and propose hierarchical attention recurrent CNN (HARCNN) to model the process of the ventral visual pathway. Four blocks are mapped to ventral visual regions V1, V2, V4, and IT. Multi-scale feature fusion is used in the V1 block to increase the receptive field, and the recurrent circuitries in V2, V4, and IT areas are employed to explore the neural dynamics. The attention module is applied to enable the model to concentrate on the important information. HARCNN is evaluated by Top-1 accuracy on three benchmark datasets of image classification, namely, CIFAR-10, CIFAR-100, and MNIST. Experimental findings prove that HARCNN is effective for modeling visual information processing and its performance is comparable to the current deep convolutional networks.