Lung cancer is a malignant tumor that poses a serious threat to human health. Existing lung cancer diagnostic techniques face the challenges of high cost and slow diagnosis. Early and rapid diagnosis and treatment are essential to improve the outcome of lung cancer. In this study, a deep learning-based multi-modal spectral information fusion (MSIF) network is proposed for lung adenocarcinoma cell detection. First, multi-modal data of Fourier transform infrared spectra, UV-vis absorbance spectra, and fluorescence spectra of normal and patient cells were collected. Subsequently, the spectral text data were efficiently processed by one-dimensional convolutional neural network. The global and local features of the spectral images are deeply mined by the hybrid model of ResNet and Transformer. An adaptive depth-wise convolution (ADConv) is introduced to be applied to feature extraction, overcoming the shortcomings of conventional convolution. In order to achieve feature learning between multi-modalities, a cross-modal interaction fusion (CMIF) module is designed. This module fuses the extracted spectral image and text features in a multi-faceted interaction, enabling full utilization of multi-modal features through feature sharing. The method demonstrated excellent performance on the test sets of Fourier transform infrared spectra, UV-vis absorbance spectra and fluorescence spectra, achieving 95.83 %, 97.92 % and 100 % accuracy, respectively. In addition, experiments validate the superiority of multi-modal spectral data and the robustness of the model generalization capability. This study not only provides strong