Machine learning-based glare prediction has greatly improved the efficiency of performance feedback. However, its limited generalizability and the absence of intuitive predictive indicators have constrained its practical application. In response, this study proposes a prediction model for luminance distribution images based on the multimodal learning approach. This model focuses on objects within the field of view, integrating spatial and material features through images. It also employs semantic feature mapping and multimodal data integration to flexibly represent building information, removing limitations on model validity imposed by changes in design scenarios. Additionally, the study proposes a multimodal Generative Adversarial Network tailored for the multimodal inputs. This network is equipped with unique feature fusion and reinforcement blocks, along with advanced up-sampling techniques, to efficiently distill and extract pertinent information from the inputs. The model's efficacy is verified by cases focusing on residential building luminance distribution, with a 97% improvement in computational speed compared to simulation methods. Offering both speed and accuracy, this model provides designers with a rapid, flexible, and intuitive supporting approach for daylight performance optimization design, particularly beneficial in the early design stage.