Recently, deep learning models have dominated hyperspectral image (HSI) classification. Nowadays, deep learning is undergoing a paradigm shift with the rise of transformer-based foundation models. In this study, the potential of transformer-based foundation models, including the vision foundation model (VFM) and language foundation model (LFM), for HSI classification are investigated. First, to improve the performance of traditional HSI classification tasks, a spectral-spatial VFM-based transformer (SS-VFMT) is proposed, which inserts spectral-spatial information into the pretrained foundation transformer. Specifically, a given pretrained transformer receives HSI patch tokens for long-range feature extraction benefiting from the prelearned weights. Meanwhile, two enhancement modules, i.e., spatial and spectral enhancement modules (SpaEMs/ SpeEMs), utilize spectral and spatial information for steering the behavior of the transformer. Besides, an additional patch relationship distillation strategy is designed for SS-VFMT to exploit the pretrained knowledge better, leading to the proposed SS-VFMT-D. Second, based on SS-VFMT, to address a new HSI classification task, i.e., generalized zero-shot classification, a spectral-spatial vision-language-based transformer (SS-VLFMT) is proposed. This task is to recognize novel classes not seen during training, which is more meaningful as the real world is usually open. The SS-VLFMT leverages SS-VFMT to extract spectral-spatial features and corresponding hash codes while integrating a pretrained language model to extract text features from class names. Experimental results on HSI datasets reveal that the proposed methods are competitive compared to the state-of-the-art methods. Moreover, the foundation model-based methods open a new window for HSI classification tasks, especially for HSI zero-shot classification.