Agriculture forms the bedrock of India's economy, contributing significantly to the nation's development and sustaining the majority of its population. However, plant stress, specifically biotic stress, poses a significant threat to agricultural sustainability, leading to a substantial decline in crop production. Biotic stress, caused by living organisms such as bacteria, fungi, and viruses, damages plant tissues and weakens their overall health. As such, the control of biotic stress is pivotal to the enhancement of agricultural sustainability. In this study, a novel approach to early detection of plant biotic stress is proposed, utilizing advancements in deep learning techniques. A Hybrid Deep Convolution Neural Network (DCNN), termed "DCNN-MCViT", has been developed, employing a multi-scale vision transformer with cross-attention for efficient detection and classification of plant illnesses. This approach diverges from traditional Convolution Neural Networks (CNNs) and leverages the emerging capabilities of Vision Transformers, a recent development in the field of computer vision that has demonstrated superior performance in image classification tasks. Evaluation results have indicated that the DCNN-MCViT model significantly outperforms other state-of-the-art techniques, achieving an average accuracy of 99.51% in stress classification and a remarkable 99.78% accuracy on the comprehensive PlantVillage dataset. Moreover, the model demonstrated a high accuracy of 99.82% in estimating the degree of severity and classifying various forms of plant biotic stress. The findings of this study underscore the potential of the DCNN-MCViT model in improving agricultural sustainability through the early detection and intervention of plant biotic stress. This research represents a significant step forward in the application of deep learning techniques to agricultural challenges and holds promise for future applications in plant health monitoring and disease management.