Glaucoma, a major cause of irreversible blindness globally, often progresses without early symptoms, making prompt and precise detection vital. This paper introduces a multi-modal glaucoma detection system that combines advanced deep learning architectures to analyze retinal images and clinical biomarkers. We developed three hybrid models: the first blends Vision Transformers (ViT) with Convolutional Neural Networks (CNN), specifically Residual Networks (ResNet), for comprehensive feature extraction; the second uses ObjectWindow-Location Vision Transformer (OWL-ViT) with Residual Networks for enhanced global contextual insights; and the third employs a Hierarchical Vision Transformer using Shifted Windows (Swin Transformer) with Residual Networks, which demonstrated the best performance. The strengths of these models, broad contextual capture by ViT, localized detail extraction by CNNs, and refined granularity by Swin Transformer, thereby improving both feature representation and computational efficiency, make them well-suited for clinical use. The best-optimized system, featuring the Swin Transformer hybrid model, achieved an F1-score of 0.993 for glaucoma and 0.995 for non-glaucoma, with an overall accuracy of 99.4% on a dataset of 2874 new cases, correctly classifying 2857 of them, thus confirming its efficacy in enhancing early-stage glaucoma detection and significantly advancing over existing methods.