With the rapid progress of artificial intelligence technology, machines need to recognize users’ emotions to provide users with a better human-computer interaction experience. Therefore, emotion recognition has become one of the active fields of artificial intelligence. Traditional emotion recognition is mostly based on text modality. Compared with single modality, multi-modal emotion recognition has the advantages of data complementarity and model robustness. In multi-modal emotion recognition, feature fusion between modalities determines the effect of emotion recognition. Recently, graph-based intra-modality fusion has attracted much attention of related research, which uses graphs of binary relationships between two modalities. When processing data of three or more modalities, the graph can hardly effectively establish the feature fusion between all modalities without introducing redundant information, limiting the performance of multi-modal emotion recognition. Therefore, it is necessary to design more effective method to model and fuse multi-modal emotion features. To solve this problem, this paper proposes an emotion recognition model Multi-modal Emotion Recognition Based on Hypergraph (MORAH) which introduces hypergraph to establish multivariate relations among multi-modal data instead of binary relations and achieves efficient multi-modal feature fusion. Specifically, the model divides multi-modal feature fusion into two stages: the hyperedge construction stage and the hypergraph learning stage. In the hyperedge construction stage, we aggregate the information of each time step in the sequence through the capsule network and establish the graph of a single modality. Then, we use graph convolution for the second aggregation, which is used as the basis for establishing hypergraph in the next stage. Benefiting from the graph capsule aggregation method, the model can work with aligned data and unaligned data at the same time, without manual alignment of unaligned data. In the hypergraph learning stage, we not only establish the association between the nodes of different modalities of the same sample but also establish the association between all modalities of the same sample. At the same time, we use hierarchical multi-level hyperedges to avoid too smooth node embedding and the simple hypergraph convolution method to fuse the high-level features between modalities, ensuring that all node features are only updated when necessary in the hypergraph convolution process. Simplified graph convolution can guarantee the effect of emotion recognition and improve the training speed without nonlinear activation and convolution filter matrix. Comprehensive experiments on two benchmark datasets show that the proposed model makes full use of the multiple relations between multi-modal data by using hypergraph. Compared with the existing advanced methods, MORAH improves the binary accuracy by 1.3% and F1-score by 1.1% on the unaligned data of the CMU-MOSI dataset. On the unaligned data of the CMU-MOSEI dataset, MORAH improves the binary accuracy and the F1-score by 0. 2%, respectively. To demonstrate the generality of the hypergraph learning stage in various multimodal tasks, we apply the hierarchical multi-level hyperedges to the emotion recognition in conversation (ERC). The experimental results indicate that MORAH can improve the performance of ERC to a certain extent. This suggests that the MORAH model can function as a universal tool to assist downstream natural language processing tasks. © 2023 Science Press. All rights reserved.