Convolutional Neural Networks (CNNs) are the most commonly used model for classifying speech segments represented in images, however, training a CNN-based classifier is usually time-consuming along with using an uncertain network topology. In this paper, a novel approach to viseme classification for automatic lip-reading is proposed. The main idea of the approach is to first map the original high dimensional imagery data into a two dimensional space by using Supervised t-Distributed Stochastic Neighbour Embedding, and then conduct classification in the low dimensional space. The effectiveness of the proposed approach has been demonstrated by classifying visemes of three different frame widths, with an average accuracy of 98.5%, 94.0%, and 82.1%, respectively. Correspondingly, in comparison, CNN-based classifiers have achieved an average accuracy of 66.2%, 75.2%, and 84.4%, respectively. In addition, the new approach has taken much less CPU time for training. The main contribution of this paper is the application of Supervised t-Distributed Stochastic Neighbour Embedding for feature extraction in lip-reading for viseme classification with varying durations with a comparison in performance to the use of a spatio-temporal CNN and analysis of how both approaches perform for varying durations.