Knowing which factors are significant in credit rating assessments leads to better decision-making. However, the focus of the literature thus far has been mostly on structured data, and fewer studies have addressed unstructured or multimodal datasets. In this paper, we present an analysis of the most effective architectures for the fusion of deep learning models to predict company credit rating classes, using structured and unstructured datasets of different types. In these models, we tested various combinations of fusion strategies with selected deep-learning models, including convolutional neural networks (CNNs) and variants of recurrent neural networks (RNNs), and pre-trained language models (BERT). We study data fusion strategies in terms of level (including early and intermediate fusion) and techniques (including concatenation and cross-attention). Our results show that a CNN-based multi-modal model with a hybrid fusion strategy outperformed other multimodal techniques. In addition, by comparing simple architectures with more complex ones, we found that more sophisticated deep learning models do not necessarily produce the highest performance. Furthermore, we found that the text channel plays amore significant role than numeric data, with the contribution of text achieving an AUC of 0.91, while the maximum AUC of numeric channels was 0.808. Finally, rating agencies on short, medium, and long-term performance show that Moody's credit ratings outperform those of other agencies like Standard & Poor's and Fitch Ratings.