Contact-measured vibration signals is the most commonly used means of health monitoring for rotating equipment. However, in the industrial site, the centrifugal fan blade will produce abnormal sounds before breaking, but the vibration signal is still stable. The acoustic signal is a non-contact measurement method and contains strong background noise excitation, making its data-driven network less trustworthy. Therefore, to effectively enhance the interpretability of acoustic signal-driven models and evaluate the credibility of model decisions, an interpretable ensemble selection framework for blade crack detection is proposed. First, a multi-view base network is constructed by adopting Alexnet-extracted depth features and interpretable auxiliary statistical features combined with an attentional weighting mechanism. Secondly, the second-order tensor of the time-frequency image of the acoustic signal is used as input to the network, and the depth features of each trained base network are adopted to determine the activation mapping area by Grad-CAM and to calculate the trustworthiness of the network in conjunction with the simulation results. Next, a Diversity Pick LIME (DP-LIME) interpretable module is constructed for embedded auxiliary features, combined with a feature weight distribution to visualize the decision logic in the base network. Finally, the blade crack detection results are selectively fused for decision-making based on the composite trustworthiness index of each base network. The interpretable framework proposed in this paper has high detection accuracy for blade crack and can effectively improve the credibility of the model, which is verified by the actual measurement data of the centrifugal fan. © 2024 Chinese Mechanical Engineering Society. All rights reserved.