Energy-efficient and grid-interactive buildings are a critical enabler for the ongoing energy transition. In recent years, data-driven methods have emerged as a popular option both to improve energy efficiency as well as leverage existing flexibility to offer it as a grid service. However, these models suffer from several shortcomings: (1) their black-box nature and lack of interpretability limits adoption in practice, (2) limited data availability and lack of counterfactuals adversely affects generalization, and (3) data-driven learning of (potentially) non-causal relationships precludes their use in downstream tasks such as active control. In this paper, using two simple case studies, we demonstrate how these issues affect data-driven techniques in practice, and how causal machine learning techniques can help obtain debiased thermal models and predictions, which are both reliable and accurate. Our results show key limitations of commonly utilized methods, the conditions in which they fail, where causal techniques can improve on these existing methods, and how their use can accelerate the exploitation of thermal energy flexibility in buildings.