Speech Emotion Recognition (SER) identifies human emotion from short speech signals that enable natural Human Computer Interactions(HCI). Accurate emotion prediction is required to create real-time interactive applications. Inaccurate or wrong predictions may create annoying situations in real-time situations. In addition to linguistic cues, human speech signals consist of numerous hidden features like cepstral, prosodic, spectrograms etc., for determining emotions. Handling only a set of features like cepstral or prosodic or spectrogram for SER does not accurately classify emotion. Machine learning models and artificial neural network architectures used in the existing works of SER individually have their abilities to handle either temporal cues or spatial cues. This research proposes a Neural Network-based Blended Ensemble Learning (NNBEL) model. This model stacks the predictions made by individual neural networks, that are capable of handling both temporal and spatial cues. The proposed model is ensembled with state-of-the-art neural network architecture, viz. 1-Dimensional Convolution Neural Network (1D CNN), Long Short-Term Memory (LSTM), and CapsuleNets. The first two architectures are especially suitable for handling speech-like time-series data, and CapsuleNets are ideal for taking spatial speech cues. Log mel-spectrogram is fed to LSTM and Mel Frequency Cepstral Coefficients (MFCCs) are fed to 1D-CNN and CapsuleNets. The predicted emotions of each of these neural networks are fed to Multi-Layer Perceptron (MLP) at the next level for predicting final emotion. The objective of considering Blended Ensemble learning is to consider the coarse-tuning of classification in the first layer of NNBEL with models namely, 1D-CNN, LSTM, CapsuleNets, and fine-tuning of classification at the second layer with Meta Classifier, MLP. The NNBEL model and individual base models are evaluated on RAVDESS and IEMOCAP datasets. The proposed model has achieved a classification accuracy of 95.3% on RAVDESS and 94% on IEMOCAP. The performance of this model is outstanding over the base models and existing models in the literature. The confusion matrix also shows a clear improvement in distinguishing the emotions.