[Objective] To effectively utilize information containing audio and video and fully capture the multi-modal interaction among text, image, and audio, this study proposes a multi-modal sentiment analysis model for online users (TIsA) incorporating text, image, and STFT-CNN audio feature extraction. [Methods] First, we separated the video data into audio and image data. Then, we used BERT and BiLSTM to obtain text feature representations and applied STFT to convert audio time-domain signals to the frequency domain. We also utilized CNN to extract audio and image features. Finally, we fused the features from the three modalities. [Results] We conducted empirical research using the9.5 Luding Earthquakepublic sentiment data from Sina Weibo. The proposed TIsA model achieved an accuracy, macro-averaged recall, and macro-averaged F1 score of 96.10%, 96.20%, and 96.10%, respectively, outperforming related baseline models. [Limitations] We should have explored the more profound effects of different fusion strategies on sentiment recognition results. [Conclusions] The proposed TIsA model demonstrates high accuracy in processing audio-containing videos, effectively supporting online public opinion analysis. © 2024 Chinese Academy of Sciences. All rights reserved.