In real-world network scenarios, modal absence may be caused by various factors, such as sensor damage, data corruption, and human errors in recording. Effectively integrating multimodal missing data still poses significant challenges. Different combinations of missing modes can form feature sets of inconsistent dimensions and quantities. Additionally, effectively merging multimodal data requires a thorough understanding of specific modal information and intermodal interactions. The abundance of missing data can significantly reduce the sample set size, leading to learning interaction features from only a few samples. Moreover, there is a lack of clear correspondence between heterogeneous data from different sources. To address these issues, we focus our research on multimodal knowledge graph scenarios with different types of structures and content and develop a new knowledge graph embedding method. First, we use three embedding components to automatically extract feature vector representations of items from the structural content, textual content, and visual content of the knowledge graph. Then, we divide the dataset into several modal groups and model these modal groups using a multilayer network structure, with each multilayer network corresponding to a specific multimodal combination. Subsequently, we construct corresponding multilayer network projection layers and propose a two-stage GAT-based transfer learning framework for the projection layers, in which the extracted incomplete multimodal information and intermodal interaction information are integrated and mapped to a low-dimensional space. Finally, we not only theoretically prove the feasibility of the proposed method but also validate its effectiveness through extensive comparative experiments on multiple datasets.