With the uninterrupted evolution of remote sensing data, the list of available data sources has expanded, effectively utilizing useful information from multiple sources for better land surface observation, which has become an intriguing and challenging problem. However, the complexity of urban areas and their surrounding structures makes it extremely difficult to capture correlations between features. This article proposes a novel multiscale attention feature fusion network, composed of hierarchical convolutional neural networks and transformer to enhance joint classification accuracy of hyperspectral image (HSI) and light detection and ranging (LiDAR) data. First, a multiscale fusion Swin transformer module is employed to eliminate information loss in feature propagation, which explores deep spatial-spectral features of HSI while extracting height information from LiDAR data. This structure combines the advantages of the Swin transformer, featuring a nonlocal receptive field fusion by progressively expanding the window's receptive field layer by layer while preserving the spatial features of the image. It also exhibits excellent robustness against spatial misalignment. For the dual branches of hyperspectral and LiDAR, a dual-source feature interactor is designed, which facilitates interaction between hyperspectral and LiDAR features by establishing a dynamic attention mechanism, which effectively captures correlated information between the two modalities and fuses it into a unified feature representation. The efficacy of the proposed approach is validated using three standard datasets (Huston2013, Trento, and MUUFL) in the experiments. The classification results indicate that the proposed framework, by fully utilizing spatial context information and effectively integrating feature information, significantly outperforms state-of-the-art classification methods.