Despite various proposed algorithms predicated upon convolution neural networks to deal with coal-gangue detection under complex production, applying Transformer into the coal-gangue detection network has been rarely executed so far. Here, a lightweight CNN- and Transformer-based coal-gangue detection network is instituted via introducing Swin Transformer blocks to promote feature fusion and achieve accurate position and identification. Transformer enables interacting long-distance semantic information and including more semantic information into low-level features. The alpha-IoU loss is further leveraged to endow accurate regression of bounding box. Compared with the output heatmap by the original network, it is found that the modified network can accurately capture the area where the target is rather than the irrelevant background area. Images acquired in three illuminances served as test datasets (A(1), A(2), and A(3)) to unearth model's illumination robustness. Outcomes denote that YOLOv5-Swin bears optimal illumination adaptability amid coal-gangue detection. Alongside pristine YOLOv5s, mAP of A(1), A(2), and A(3) jump by 2.53%, 2.4%, 2.84%, respectively, while detection velocity can run at 147 FPS, twice as fast as YOLOv3's velocity. This method meets the needs of real-time detection, which can accurately and quickly detect coal and gangue.