Defective insulators in substations pose a major risk to the safe and stable operation of the power grid. To promote intelligent operation and maintenance of substations, efficient and accurate insulator defect detection algorithms are of great significance. Aiming at the problem that insulator defect regions are poor in pixel information, and distinct in shapes and sizes, a multi-scale defect detection network (MSD2Net) was proposed. First, this paper analyzes the main challenge currently faced in insulator defect detection. Secondly, to accommodate insufficient pixel information of insulator defects, the model is improved based on SSD detector, replacing ResNet with the attentional feature extraction network. Thirdly, to detect targets at different scales, the feature fusion network is designed, and a deconvolution structure is used to enhance its automatic learning ability. In addition, MSD2Net uses Focal loss as the classification loss and Gaussian non-maximum suppression as the post-processing method, which further improves the detection performance. For the model experiment, a defective insulator dataset in substation scenarios is produced by image processing methods. To enhance the diversity of the dataset, data augmentation operations are adopted such as color transformation, random crop, and random flip. Based on the dataset, the MSD2Net achieves a mean average precision (mAP) of 94.3%. Compared with the baseline network SSD and the classic single-stage network RetinaNet, MSD2Net improves the mAP value by 4.5% and 3.9%, respectively. In addition, when tested on the public Chinese power line insulator dataset (CPLID), the mAP of MSD2Net reaches 91.2%, higher than the SSD and VFNet models by 2.7% and 7.9%. The results show that the proposed model in this paper can effectively identify insulators and their defects in power inspection images. The following conclusions can be drawn from the experimental analysis: ①The attention-based backbone network can reduce the loss of information and enhance the information interaction between feature map groups, thus extracting more critical information. ②The deconvolution fusion module realizes the fusion of deep and shallow features, thereby providing more complete feature information to the detection module. ③Focal Loss makes the network focus on positive samples and therefore alleviates the imbalance of positive and negative samples. At the same time, Gaussian non-maximum suppression mitigates the effects of the missed detection of overlapping targets. © 2023 Chinese Machine Press. All rights reserved.