The fusion of hyperspectral image (HSI) and light detection and ranging (LiDAR) data offers a powerful approach for land cover classification; however, challenges remain in effectively integrating their complementary information. Existing methods often overlook the importance of spatial information and fail to fully exploit the synergy between HSI and LiDAR data. To address these limitations, this paper proposes M2SSCENet, a multi-branch multi-scale joint learning and spatial-spectral cross-enhancement network. M2SSCENet employs a three-branch architecture to extract HSI spectral features, HSI spatial features, and LiDAR features, respectively. For cross-modal fusion, the network proposes two novel modules: the cross-modality bilateral attention feature fusion module enhances the interaction between HSI spectral features and LiDAR features, while the spatial attention-guided cross-modality fusion module dynamically adjusts spatial attention to capture key elevation information. Additionally, a pixel distance-based proximal feature selection module is proposed to enhance spatial feature representation by emphasizing neighboring pixels with higher contributions. Experimental results on the Trento and Houston2013 datasets demonstrate the superiority of M2SSCENet, achieving OA of 98.44% and 94.33%, respectively. Compared with suboptimal methods on each dataset, M2SSCENet improves classification accuracy by 0.27% on the Trento dataset and by 2.03% on the Houston2013 dataset. Notably, for categories with similar spectral distributions but significant elevation differences, such as "Highway" and "Parking Lot 1," the proposed method achieves accuracy improvements of 2.18% and 4.75%, respectively. These results highlight the effectiveness of M2SSCENet in leveraging the complementary strengths of HSI and LiDAR data for improved land cover classification.