Predictive maintenance advocates the use of artificial intelligence to analyze big data and provides support for monitoring health conditions and planning maintenance activities in smart manufacturing systems. However, deep networks with massive parameters incur high computational costs, hindering its practical applications. Knowledge distillation (KD) transfers knowledge from a deep network to a lightweight network at the cost of an evident decline in diagnostic accuracy. Instead of an accuracy-size trade-off, a framework for creating a lightweight residual network with higher diagnostic accuracy is proposed in this paper. First, the generalization ability of the teacher network is enhanced based on a domain-adversarial neural network (DANN) with normalization attention mechanism (NAM) and ResNet18. Then, a small student network, ResNet6, efficiently learns the knowledge from its well-learned teacher network through the improved probability KD (IPKD) and uniform quantization. The IPKD is designed to obtain the utmost knowledge, represented as probability distributions of labels and feature space, and narrow the gap between teacher and student networks. Meanwhile, the uniform quantization is incorporated into the distillation process for quantization and distillation cooptimization, further compressing the student network. Experiments on two open-source and one wheelset bearing sets are conducted for performance verification and comparison. The results demonstrate that the obtained lightweight student network has a similar accuracy to the deep network for cross-domain fault diagnosis of bearings with various damages and speeds, even bearing sources, and its quantized network has a much smaller size while retaining a comparatively similar accuracy, which makes it more practical for actual industrial equipment.