Person identifications using the ear-based biometric system has become quite popular in recent year due to increasing demands in security and surveillance applications. With limited training data and computing resources, the run time complexity plays an important role in such a biometric system. With the continuous advancement in deep convolutional neural networks, deep learning-based biometric systems consequently achieved huge progress in solving earlier unanswered and/or incomplete challenges. Though ear-based biometric system gives higher accuracy with the help of pre-trained deep learning models like VGG19, VGG16, Xception, etc. Training these models is a cumbersome task and it requires much time. Most of the ear recognition system developed using deep learning models like VGG19, Xception, ResNet101, etc. requires a large memory area due to the huge parameter requirement of the model. Also, they put computational overhead on the system. One of the major challenges in the field of ear recognition is to identify people with the help of electronic devices over time and space. While developing electronic approaches for person identification, it is worth important to consider the factors like simplicity, cost-effectiveness, and portable flexibility. With these motivations, the authors developed three simple lightweight CNN models and ensemble them to get improved recognition accuracy. The work is validated on the IITD-II ear dataset which contains only 793 sample mages for training purposes. To overcome the limitation of the limited dataset, the author performed data augmentation technique which produces a variety of images from different perspectives. By stacking these three CNN models, an optimal architecture is developed that gives the best accuracy of 98.74% which is a good improvement over the individual model. The proposed CNN models can also be ensemble with other pre-trained models like VGG16, VGG19, ResNet, Xception, etc. for a more effective solution.